Machine Learning Crash Course



Introduction to Machine Learning

the original course

This module introduces Machine Learning (ML).

Additional Information

Help Center

Framing

This module investigates how to frame a task as a machine learning problem, and covers many of the basic vocabulary terms shared across a wide range of machine learning (ML) methods.

What is (Supervised) Machine Learning?

ML systems learn

how to combine input

to produce useful predictions

on never-before-seen data

Terminology: Labels and Features

  • Label is the true thing we're predicting: y
    • The y variable in basic linear regression

Terminology: Labels and Features

  • Label is the true thing we're predicting: y
    • The y variable in basic linear regression
  • Features are input variables describing our data: xi
    • The {x1, x2, ... xn} variables in basic linear regression

Terminology: Examples and Models

  • Example is a particular instance of data, x
  • Labeled example has {features, label}: (x, y)
    • Used to train the model
  • Unlabeled example has {features, ?}: (x, ?)
    • Used for making predictions on new data

Terminology: Examples and Models

  • Example is a particular instance of data, x
  • Labeled example has {features, label}: (x, y)
    • Used to train the model
  • Unlabeled example has {features, ?}: (x, ?)
    • Used for making predictions on new data
  • Model maps examples to predicted labels: y'
    • Defined by internal parameters, which are learned
Help Center

Framing: Key ML Terminology

What is (supervised) machine learning? Concisely put, it is the following:

Let's explore fundamental machine learning terminology.

Labels

A label is the thing we're predicting—the y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.

Features

A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as:

$$\{x_1, x_2, ... x_N\}$$

In the spam detector example, the features could include the following:

Examples

An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:

A labeled example includes both feature(s) and the label. That is:

  labeled examples: {features, label}: (x, y)

Use labeled examples to train the model. In our spam detector example, the labeled examples would be individual emails that users have explicitly marked as "spam" or "not spam."

For example, the following table shows 5 labeled examples from a data set containing information about housing prices in California:

housingMedianAge
(feature)
totalRooms
(feature)
totalBedrooms
(feature)
medianHouseValue
(label)
15 5612 1283 66900
19 7650 1901 80100
17 720 174 85700
14 1501 337 73400
20 1454 326 65500

An unlabeled example contains features but not the label. That is:

  unlabeled examples: {features, ?}: (x, ?)

Here are 3 unlabeled examples from the same housing dataset, which exclude medianHouseValue:

housingMedianAge
(feature)
totalRooms
(feature)
totalBedrooms
(feature)
42 1686 361
34 1226 180
33 1077 271

Once we've trained our model with labeled examples, we use that model to predict the label on unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't yet labeled.

Models

A model defines the relationship between features and label. For example, a spam detection model might associate certain features strongly with "spam". Let's highlight two phases of a model's life:

Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:

Help Center

Framing: Check Your Understanding

Supervised Learning

Explore the options below.

Suppose you want to develop a supervised machine learning model to predict whether a given email is "spam" or "not spam." Which of the following statements are true?
Emails not marked as "spam" or "not spam" are unlabeled examples.
Because our label consists of the values "spam" and "not spam", any email not yet marked as spam or not spam is an unlabeled example.
Words in the subject header will make good labels.
Words in the subject header might make excellent features, but they won't make good labels.
We'll use unlabeled examples to train the model.
We'll use labeled examples to train the model. We can then run the trained model against unlabeled examples to infer whether the unlabeled email messages are spam or not spam.
The labels applied to some examples might be untrustworthy.
Definitely. The labels for this dataset probably come from email users who mark particular email messages as spam. Since very few users mark every suspicious email message as spam, we may have a hard time ever knowing whether an email is spam. Furthermore, some spammers or botnets could intentionally poison our model by providing faulty labels.

Features and Labels

Explore the options below.

Suppose an online shoe store wants to create a supervised ML model that will provide personalized shoe recommendations to users. That is, the model will recommend certain pairs of shoes to Marty and different pairs of shoes to Janet. Which of the following statements are true?
Shoe size is a useful feature.
Shoe size is a quantifiable signal that likely has a strong impact on whether the user will like the recommended shoes. For example, if Marty wears size 9, the model shouldn't recommend size 7 shoes.
Shoe beauty is a useful feature.
Good features are concrete and quantifiable. Beauty is too vague a concept to serve as a useful feature. Beauty is probably a blend of certain concrete features, such as style and color. Style and color would each be better features than beauty.
User clicks on a shoe's description is a useful label.
Users probably only want to read more about those shoes that they like. User clicks is, therefore, an observable, quantifiable metric that could serve as a good training label.
The shoes that a user adores is a useful label.
Adoration is not an observable, quantifiable metric. The best we can do is search for observable proxy metrics for adoration.
Help Center

Descending into ML

Linear regression is a method for finding the straight line or hyperplane that best fits a set of points. This module explores linear regression intuitively before laying the groundwork for a machine learning approach to linear regression.

Learning From Data

  • There are lots of complex ways to learn from data
  • But we can start with something simple and familiar
  • Starting simple will open the door to some broadly useful methods
A model overfitting its data

A Convenient Loss Function for Regression

L2 Loss for a given example is also called squared error

= Square of the difference between prediction and label

= (observation - prediction)2

= (y - y')2

A graph of predicted value vs. loss image/svg+xml
predicted value L2 Loss target value = 0.0 target value = 1.7

Defining L2 Loss on a Data Set

$$ L_2Loss = \sum_{(x,y)\in D} (y - prediction(x))^2 $$

\(\sum \text{:We're summing over all examples in the training set.}\) \(D \text{: Sometimes useful to average over all examples,}\) \(\text{so divide out by} \frac{1}{\|D\|}.\)

Help Center

Descending into ML: Linear Regression

It has long been known that crickets (an insect species) chirp more frequently on hotter days than on cooler days. For decades, professional and amateur scientists have cataloged data on chirps-per-minute and temperature. As a birthday gift, your Aunt Ruth gives you her cricket database and asks you to learn a model to predict this relationship. Using this data, you want to explore this relationship.

First, examine your data by plotting it:

Raw data of chirps/minute (x-axis) vs. temperature (y-axis). 0 25 50 75 100 125 150 175 Cricket Chirps Per Minute 5 10 15 20 25 30 Temperature in Celsius

Figure 1. Chirps per Minute vs. Temperature in Celsius.

As expected, the plot shows the temperature rising with the number of chirps. Is this relationship between chirps and temperature linear? Yes, you could draw a single straight line like the following to approximate this relationship:

Best line establishing relationship of chirps/minute (x-axis) vs. temperature (y-axis). 0 25 50 75 100 125 150 175 Cricket Chirps Per Minute 5 10 15 20 25 30 35 Temperature in Celsius

Figure 2. A linear relationship.

True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:

$$ y = mx + b $$

where:

By convention in machine learning, you'll write the equation for a model slightly differently:

$$ y' = b + w_1x_1 $$

where:

To infer (predict) the temperature \(y'\) for a new chirps-per-minute value \(x_1\), just substitute the \(x_1\) value into this model.

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (\(w_1\), \(w_2\), etc.). For example, a model that relies on three features might look as follows:

$$y' = b + w_1x_1 + w_2x_2 + w_3x_3$$

 

Help Center

Descending into ML: Training and Loss

Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.

Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. For example, Figure 3 shows a high loss model on the left and a low loss model on the right. Note the following about the figure:

Two Cartesian plots, each showing a line and some data points. In the first plot, the line is a terrible fit for the data, so the loss is high. In the second plot, the line is a a better fit for the data, so the loss is low.

Figure 3. High loss in the left model; low loss in the right model.

 

Notice that the red arrows in the left plot are much longer than their counterparts in the right plot. Clearly, the blue line in the right plot is a much better predictive model than the blue line in the left plot.

You might be wondering whether you could create a mathematical function—a loss function—that would aggregate the individual losses in a meaningful fashion.

Squared loss: a popular loss function

The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:

  = the square of the difference between the label and the prediction
  = (observation - prediction(x))2
  = (y - y')2

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

$$ MSE = \frac{1}{N} \sum_{(x,y)\in D} (y - prediction(x))^2 $$

where:

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

 

Help Center

Descending into ML: Check Your Understanding

Mean Squared Error

Consider the following two plots:

A plot of 10 points. A line runs through 6 of the points. 2 points are 1 "unit" above the line; 2 other points are 1 "unit" below the line. A plot of 10 points. A line runs through 8 of the points. 1 point is 2 "units" above the line; 1 other point is 2 "units" below the line.

Explore the options below.

Which of the two data sets shown in the preceding plots has the higher Mean Squared Error (MSE)?
The dataset on the left.
The six examples on the line incur a total loss of 0. The four examples not on the line are not very far off the line, so even squaring their offset still yields a low value: $$ MSE = \frac{0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 0^2} {10} = 0.4$$
The dataset on the right.
The eight examples on the line incur a total loss of 0. However, although only two points lay off the line, both of those points are twice as far off the line as the outlier points in the left figure. Squared loss amplifies those differences, so an offset of two incurs a loss four times greater than an offset of one.
$$ MSE = \frac{0^2 + 0^2 + 0^2 + 2^2 + 0^2 + 0^2 + 0^2 + 2^2 + 0^2 + 0^2} {10} = 0.8$$
Help Center

Reducing Loss

To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.

How do we reduce loss?

  • Hyperparameters are the configuration settings used to tune how the model is trained.
  • Derivative of (y - y')2 with respect to the weights and biases tells us how loss changes for a given example
    • Simple to compute and convex
  • So we repeatedly take small steps in the direction that minimizes loss
    • We call these Gradient Steps (But they're really negative Gradient Steps)
    • This strategy is called Gradient Descent

Block Diagram of Gradient Descent

The cycle of moving from features and labels to models and predictions. Compute parameter updates ComputeLoss Model(PredictionFunction) Features Label Inference: Make Predictions

Weight Initialization

  • For convex problems, weights can start anywhere (say, all 0s)
    • Convex: think of a bowl shape
    • Just one minimum
Convex bowl shaped graph

Weight Initialization

  • For convex problems, weights can start anywhere (say, all 0s)
    • Convex: think of a bowl shape
    • Just one minimum
  • Foreshadowing: not true for neural nets
    • Non-convex: think of an egg crate
    • More than one minimum
    • Strong dependency on initial values
Convex bowl shaped graph and graph with multiple local minima

SGD & Mini-Batch Gradient Descent

  • Could compute gradient over entire data set on each step, but this turns out to be unnecessary
  • Computing gradient on small data samples works well
    • On every step, get a new random sample
  • Stochastic Gradient Descent: one example at a time
  • Mini-Batch Gradient Descent: batches of 10-1000
    • Loss & gradients are averaged over the batch
Help Center

Reducing Loss: An Iterative Approach

The previous module introduced the concept of loss. Here, in this module, you'll learn how a machine learning model iteratively reduces loss.

Iterative learning might remind you of the "Hot and Cold" kid's game for finding a hidden object like a thimble. In this game, the "hidden object" is the best possible model. You'll start with a wild guess ("The value of \(w_1\) is 0.") and wait for the system to tell you what the loss is. Then, you'll try another guess ("The value of \(w_1\) is 0.5.") and see what the loss is. Aah, you're getting warmer. Actually, if you play this game right, you'll usually be getting warmer. The real trick to the game is trying to find the best possible model as efficiently as possible.

The following figure suggests the iterative trial-and-error process that machine learning algorithms use to train a model:

The cycle of moving from features and labels to models and predictions. Compute parameter updates ComputeLoss Model(PredictionFunction) Features Label Inference: Make Predictions

Figure 1. An iterative approach to training a model.

We'll use this same iterative approach throughout Machine Learning Crash Course, detailing various complications, particularly within that stormy cloud labeled "Model (Prediction Function)." Iterative strategies are prevalent in machine learning, primarily because they scale so well to large data sets.

The "model" takes one or more features as input and returns one prediction (y') as output. To simplify, consider a model that takes one feature and returns one prediction:

$$ y' = b + w_1x_1 $$

What initial values should we set for \(b\) and \(w_1\)? For linear regression problems, it turns out that the starting values aren't important. We could pick random values, but we'll just take the following trivial values instead:

Suppose that the first feature value is 10. Plugging that feature value into the prediction function yields:

  y' = 0 + 0(10)
  y' = 0

The "Compute Loss" part of the diagram is the loss function that the model will use. Suppose we use the squared loss function. The loss function takes in two input values:

At last, we've reached the "Compute parameter updates" part of the diagram. It is here that the machine learning system examines the value of the loss function and generates new values for \(b\) and \(w_1\). For now, just assume that this mysterious box devises new values and then the machine learning system re-evaluates all those features against all those labels, yielding a new value for the loss function, which yields new parameter values. And the learning continues iterating until the algorithm discovers the model parameters with the lowest possible loss. Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.

Help Center

Reducing Loss: Gradient Descent

The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.

Suppose we had the time and the computing resources to calculate the loss for all possible values of \(w_1\). For the kind of regression problems we've been examining, the resulting plot of loss vs. \(w_1\) will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:

image/svg+xmlA second point on the U-shaped curve, this one a little closer to the minimum point.loss value of weight wi

Figure 2. Regression problems yield convex loss vs weight plots.

 

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of \(w_1\) over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for \(w_1\). The starting point doesn't matter much; therefore, many algorithms simply set \(w_1\) to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:

image/svg+xmlA second point on the U-shaped curve, this one a little closer to the minimum point.starting point loss value of weight wi

Figure 3. A starting point for gradient descent.

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

Note that a gradient is a vector, so it has both of the following characteristics:

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

image/svg+xmlA second point on the U-shaped curve, this one a little closer to the minimum point.starting point loss (negative) gradient value of weight wi

Figure 4. Gradient descent relies on negative gradients.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

image/svg+xmlA second point on the U-shaped curve, this one a little closer to the minimum point.starting point loss (negative) gradient value of weight wi next point

Figure 5. A gradient step moves us to the next point on the loss curve.

The gradient descent then repeats this process, edging ever closer to the minimum.

Help Center

Reducing Loss: Learning Rate

As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long:

image/svg+xmlSame U-shaped curve. Lots of points are very close to each other and their trail is making extremely slow progress towards the bottom of the U.value of weight wi loss Small learning rate takes forever! starting point

Figure 6. Learning rate is too small.

Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:

image/svg+xmlSame U-shaped curve. This one contains very few points. The trail of points jumps clean across the bottom of the U and then jumps back over again.value of weight wi loss Overshoots the minimum! starting point

Figure 7. Learning rate is too large.

There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.

image/svg+xmlSame U-shaped curve. The trail of points gets to the minimum point in about eight steps.starting point value of weight wi loss We’ll get there efficiently.

Figure 8. Learning rate is just right.

Help Center

Optimizing Learning Rate

Exercise 1

Set a learning rate of 0.1 on the slider. Keep hitting the STEP button until the gradient descent algorithm reaches the minimum point of the loss curve. How many steps did it take?

Exercise 2

Can you reach the minimum more quickly with a higher learning rate? Set a learning rate of 1, and keep hitting STEP until gradient descent reaches the minimum. How many steps did it take this time?

Exercise 3

How about an even larger learning rate. Reset the graph, set a learning rate of 4, and try to reach the minimum of the loss curve. What happened this time?

Optional Challenge

Can you find the Goldilocks learning rate for this curve, where gradient descent reaches the minimum point in the fewest number of steps? What is the fewest number of steps required to reach the minimum?

Help Center

Reducing Loss: Stochastic Gradient Descent

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we've assumed that the batch has been the entire data set. When working at Google scale, data sets often contain billions or even hundreds of billions of examples. Furthermore, Google data sets often contain huge numbers of features. Consequently, a batch can be enormous. A very large batch may cause even a single iteration to take a very long time to compute.

A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

To simplify the explanation, we focused on gradient descent for a single feature. Rest assured that gradient descent also works on feature sets that contain multiple features.

 

Help Center

Reducing Loss: Playground Exercise

Learning Rate and Convergence

This is the first of several Playground exercises. Playground is a program developed especially for this course to teach machine learning principles.

Each Playground exercise generates a dataset. The label for this dataset has two possible values. You could think of those two possible values as spam vs. not spam or perhaps healthy trees vs. sick trees. The goal of most exercises is to tweak various hyperparameters to build a model that successfully classifies (separates or distinguishes) one label value from the other. Note that most data sets contain a certain amount of noise that will make it impossible to successfully classify every example.

The interface for this exercise provides three buttons:

Icon Name What it Does
Reset button. Reset Resets Iterations to 0. Resets any weights that model had already learned.
Step button. Step Advance one iteration. With each iteration, the model changes—sometimes subtly and sometimes dramatically.
Regenerate button. Regenerate Generates a new data set. Does not reset Iterations.

In this first Playground exercise, you'll experiment with learning rate by performing two tasks.

Task 1: Notice the Learning rate menu at the top-right of Playground. The given Learning rate—3—is very high. Observe how that high Learning rate affects your model by clicking the "Step" button 10 or 20 times. After each early iteration, notice how the model visualization changes dramatically. You might even see some instability after the model appears to have converged. Also notice the lines running from x1 and x2 to the model visualization. The weights of these lines indicate the weights of those features in the model. That is, a thick line indicates a high weight.

Task 2: Do the following:

  1. Press the Reset button.
  2. Lower the Learning rate.
  3. Press the Step button a bunch of times.

How did the lower learning rate impact convergence? Examine both the number of steps needed for the model to converge, and also how smoothly and steadily the model converges. Experiment with even lower values of learning rate. Can you find a learning rate too slow to be useful? (You'll find a discussion just below the exercise.)



Help Center

Reducing Loss: Check Your Understanding

Check Your Understanding: Batch Size

Explore the options below.

When performing gradient descent on a large data set, which of the following batch sizes will likely be more efficient?
The full batch.
Computing the gradient from a full batch is inefficient. That is, the gradient can usually be computed far more efficiently (and just as accurately) from a smaller batch than from a vastly bigger full batch.
A small batch or even a batch of one example (SGD).
Amazingly enough, performing gradient descent on a small batch or even a batch of one example is usually more efficient than the full batch. After all, finding the gradient of one example is far cheaper than finding the gradient of millions of examples. To ensure a good representative sample, the algorithm scoops up another random small batch (or batch of one) on every iteration.

 

Help Center

First Steps with TensorFlow

TensorFlow API Hierarchy

Hierarchy of TensorFlow toolkits. Estimator API is at the top.image/svg+xmlTensorFlow Estimators tf.layers , tf.losses , tf.metrics Python TensorFlow C++ TensorFlow CPU GPU High-level, object-oriented API Reusable libraries for common model components Provides Ops, which wrap C++ Kernels Kernels work on one or more platforms TPU

A Quick Look at the tf.estimator API

import tensorflow as tf
# Set up a linear classifier.
classifier = tf.estimator.LinearClassifier(feature_columns)
# Train the model on some example data.
classifier.train(input_fn=train_input_fn, steps=2000)
# Use it to predict.
predictions = classifier.predict(input_fn=predict_input_fn)

Help Center

First Steps with TensorFlow: Toolkit

Tensorflow is a computational framework for building machine learning models. TensorFlow provides a variety of different toolkits that allow you to construct models at your preferred level of abstraction. You can use lower-level APIs to build models by defining a series of mathematical operations. Alternatively, you can use higher-level APIs (like tf.estimator) to specify predefined architectures, such as linear regressors or neural networks.

The following figure shows the current hierarchy of TensorFlow toolkits:

Hierarchy of TensorFlow toolkits. Estimator API is at the top.image/svg+xmlTensorFlow Estimators tf.layers , tf.losses , tf.metrics Python TensorFlow C++ TensorFlow CPU GPU High-level, object-oriented API Reusable libraries for common model components Provides Ops, which wrap C++ Kernels Kernels work on one or more platforms TPU

Figure 1. TensorFlow toolkit hierarchy.

The following table summarizes the purposes of the different layers:

Toolkit(s) Description
Estimator (tf.estimator) High-level, OOP API.
tf.layers/tf.losses/tf.metrics Libraries for common model components.
TensorFlow Lower-level APIs

TensorFlow consists of the following two components:

These two components are analogous to Python code and the Python interpreter. Just as the Python interpreter is implemented on multiple hardware platforms to run Python code, TensorFlow can run the graph on multiple hardware platforms, including CPU, GPU, and TPU.

Which API(s) should you use? You should use the highest level of abstraction that solves the problem. The higher levels of abstraction are easier to use, but are also (by design) less flexible. We recommend you start with the highest-level API first and get everything working. If you need additional flexibility for some special modeling concerns, move one level lower. Note that each level is built using the APIs in lower levels, so dropping down the hierarchy should be reasonably straightforward.

tf.estimator API

We'll use tf.estimator for the majority of exercises in Machine Learning Crash Course. Everything you'll do in the exercises could have been done in lower-level (raw) TensorFlow, but using tf.estimator dramatically lowers the number of lines of code.

tf.estimator is compatible with the scikit-learn API. Scikit-learn is an extremely popular open-source ML library in Python, with over 100k users, including many at Google.

Very broadly speaking, here's the pseudocode for a linear classification program implemented in tf.estimator:

import tensorflow as tf
# Set up a linear classifier.
classifier = tf.estimator.LinearClassifier(feature_columns)
# Train the model on some example data.
classifier.train(input_fn=train_input_fn, steps=2000)
# Use it to predict.
predictions = classifier.predict(input_fn=predict_input_fn)
Help Center

First Steps with TensorFlow: Programming Exercises

As you progress through Machine Learning Crash Course, you'll put the principles and techniques you learn into practice by coding models using tf.estimator, a high-level TensorFlow API.

The programming exercises in Machine Learning Crash Course use a data-analysis platform that combines code, output, and descriptive text into one collaborative document.

Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

Run the following three exercises in the provided order:

  1. Quick Introduction to pandas. pandas is an important library for data analysis and modeling, and is widely used in TensorFlow coding. This tutorial provides all the pandas information you need for this course. If you already know pandas, you can skip this exercise.

  2. First Steps with TensorFlow. This exercise explores linear regression.

  3. Synthetic Features and Outliers. This exercise explores synthetic features and the effect of input outliers.

Common hyperparameters in Machine Learning Crash Course exercises

Many of the coding exercises contain the following hyperparameters:

The following formula applies:

$$ total\,number\,of\,trained\,examples = batch\,size * steps $$

A convenience variable in Machine Learning Crash Course exercises

The following convenience variable appears in several exercises:

The following formula applies:

$$ number\,of\,training\,examples\,in\,each\,period = \frac{batch\,size * steps} {periods} $$

 

Newest TensorFlow questions on Stack Overflow Help Center

Generalization

Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

The Big Picture

Cycle of model, prediction, sample, discovering true distribution, more sampling image/svg+xmli.i.d. Hidden Truth model learn New Sample i.i.d. predict Empirical Data Sample
  • Goal: predict well on new data drawn from (hidden) true distribution.
  • Problem: we don't see the truth.
    • We only get to sample from it.

The Big Picture

Cycle of model, prediction, sample, discovering true distribution, more sampling image/svg+xmli.i.d. Hidden Truth model learn New Sample i.i.d. predict Empirical Data Sample
  • Goal: predict well on new data drawn from (hidden) true distribution.
  • Problem: we don't see the truth.
    • We only get to sample from it.
  • If model h fits our current sample well, how can we trust it will predict well on other new samples?

How Do We Know If Our Model Is Good?

  • Theoretically:
    • Interesting field: generalization theory
    • Based on ideas of measuring model simplicity / complexity
  • Intuition: formalization of Occam's Razor principle
    • The less complex a model is, the more likely that a good empirical result is not just due to the peculiarities of our sample

How Do We Know If Our Model Is Good?

  • Empirically:
    • Asking: will our model do well on a new sample of data?
    • Evaluate: get a new sample of data-call it the test set
    • Good performance on the test set is a useful indicator of good performance on the new data in general:
      • If the test set is large enough
      • If we don't cheat by using the test set over and over

The ML Fine Print

Three basic assumptions in all of the above:

  1. We draw examples independently and identically (i.i.d.) at random from the distribution
  2. The distribution is stationary: It doesn't change over time
  3. We always pull from the same distribution: Including training, validation, and test sets
Help Center

Training and Test Sets

A test set is a data set used to evaluate the model developed from a training set.

Partitioning Data Sets

A horizontal bar divided into two pieces: 80% of which is the training set and 20% the test set. Training Set Test Set

Train Evaluation vs Test Evaluation

image/svg+xmlTwo models: one run on training data and the other on test data. The model is very simple, just a line dividing the orange dots from the blue dots. The loss on the training data is similar to the loss on the test data.Training Data Test Data

What If We Only Have One Data Set?

  • Divide into two sets:
    • training set
    • test set
  • Classic gotcha: do not train on test data
    • Getting surprisingly low loss?
    • Before celebrating, check if you're accidentally training on test data
Help Center

Training and Test Sets: Splitting Data

The previous module introduced the idea of dividing your data set into two subsets:

You could imagine slicing the single data set as follows:

A horizontal bar divided into two pieces: 80% of which is the training set and 20% the test set. Training Set Test Set

Figure 1. Slicing a single data set into a training set and test set.

Make sure that your test set meets the following two conditions:

Assuming that your test set meets the preceding two conditions, your goal is to create a model that generalizes well to new data. Our test set serves as a proxy for new data. For example, consider the following figure. Notice that the model learned for the training data is very simple. This model doesn't do a perfect job—a few predictions are wrong. However, this model does about as well on the test data as it does on the training data. In other words, this simple model does not overfit the training data.

image/svg+xmlTwo models: one run on training data and the other on test data. The model is very simple, just a line dividing the orange dots from the blue dots. The loss on the training data is similar to the loss on the test data.Training Data Test Data

Figure 2. Validating the trained model against test data.

Never train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.

For example, consider a model that predicts whether an email is spam, using the subject line, email body, and sender's email address as features. We apportion the data into training and test sets, with an 80-20 split. After training, the model achieves 99% precision on both the training set and the test set. We'd expect a lower precision on the test set, so we take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set (we neglected to scrub duplicate entries for the same spam email from our input database before splitting the data). We've inadvertently trained on some of our test data, and as a result, we're no longer accurately measuring how well our model generalizes to new data.

Help Center

Training and Test Sets: Playground Exercise

Training Sets and Test Sets

We return to Playground to experiment with training sets and test sets.

This exercise provides both a test set and a training set, both drawn from the same data set. By default, the visualization shows only the training set. If you'd like to also see the test set, click the Show test data checkbox just below the visualization. In the visualization, note the following distinction:

Task 1: Run Playground with the given settings by doing the following:

  1. Click the Run/Pause button:
  2. Watch the Test loss and Training loss values change.
  3. When the Test loss and Training loss values stop changing or only change once in a while, press the Run/Pause button again to pause Playground.
Note the delta between the Test loss and Training loss. We'll try to reduce this delta in the following tasks.

Task 2: Do the following:

  1. Press the Reset button.
  2. Modify the Learning rate.
  3. Press the Run/Pause button:
  4. Let Playground run for at least 150 epochs.

Is the delta between Test loss and Training loss lower or higher with this new Learning rate? What happens if you modify both Learning rate and batch size?

Optional Task 3: A slider labeled Ratio of training to test data lets you control the proportion of test data to training data. For example, when set to 90%, the training set contains many more examples than the test set. When set to 10%, the training set contains far fewer examples than the test set.

Do the following:

  1. Reduce the "Ratio of training data to test data" from 50% to 10%.
  2. Experiment with Learning rate and Batch size, taking notes on your findings.
Does altering the Ratio of training data to test data change the optimal learning settings that you discovered in Task 2? If so, why?

Help Center

Validation: Check Your Intuition

Before beginning this module, consider whether there are any pitfalls in using the training process outlined in Training and Test Sets.

Explore the options below.

We looked at a process of using a test set and a training set to drive iterations of model development. On each iteration, we'd train on the training data and evaluate on the test data, using the evaluation results on test data to guide choices of and changes to various model hyperparameters like learning rate and features. Is there anything wrong with this approach? (Pick only one answer.)
Totally fine, we're training on training data and evaluating on separate, held-out test data.
Actually, there's a subtle issue here. Think about what might happen if we did many, many iterations of this form.
Doing many rounds of this procedure might cause us to implicitly fit to the peculiarities of our specific test set.
Yes indeed! The more often we evaluate on a given test set, the more we are at risk for implicitly overfitting to that one test set. We'll look at a better protocol next.
This is computationally inefficient. We should just pick a default set of hyperparameters and live with them to save resources.
Although these sorts of iterations are expensive, they are a critical part of model development. Hyperparameter settings can make an enormous difference in model quality, and we should always budget some amount of time and computational resources to ensure we're getting the best quality we can.
Help Center

Validation

Partitioning a data set into a training set and test set lets you judge whether a given model will generalize well to new data. However, using only two partitions may be insufficient when doing many rounds of hyperparameter tuning.

A Possible Workflow?

A workflow diagram consisting of three stages. 1. Train model on training set. 2. Evaluate model on test set. 3. Tweak model according to results on test set. Iterate on 1, 2, and 3, ultimately picking the model that does best on the test set. Train model on Training Set Evaluate model on Test Set Tweak model according to results on Test Set Pick model that does best on Test Set

Partitioning Data Sets

A horizontal bar divided into three pieces: 70% of which is the training set, 15% the validation set, and 15% the test set Training Set Validation Set Test Set

Better Workflow: Use a Validation Set

Similar workflow to Figure 1, except that instead of evaluating the model against the test set, the workflow evaluates the model against the validation set. Then, once the training set and validation set more-or-less agree, confirm the model against the test set. Train model on Training Set Evaluate model on Validation Set Tweak model according to results on Validation Set Pick model that does best on Validation Set Confirm results on Test Set
Help Center

Validation: Another Partition

The previous module introduced partitioning a data set into a training set and a test set. This partitioning enabled you to train on one set of examples and then to test the model against a different set of examples. With two partitions, the workflow could look as follows:

A workflow diagram consisting of three stages. 1. Train model on training set. 2. Evaluate model on test set. 3. Tweak model according to results on test set. Iterate on 1, 2, and 3, ultimately picking the model that does best on the test set. Train model on Training Set Evaluate model on Test Set Tweak model according to results on Test Set Pick model that does best on Test Set

Figure 1. A possible workflow?

In the figure, "Tweak model" means adjusting anything about the model you can dream up—from changing the learning rate, to adding or removing features, to designing a completely new model from scratch. At the end of this workflow, you pick the model that does best on the test set.

Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

A horizontal bar divided into three pieces: 70% of which is the training set, 15% the validation set, and 15% the test set Training Set Validation Set Test Set

Figure 2. Slicing a single data set into three subsets.

Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has "passed" the validation set. The following figure shows this new workflow:

Similar workflow to Figure 1, except that instead of evaluating the model against the test set, the workflow evaluates the model against the validation set. Then, once the training set and validation set more-or-less agree, confirm the model against the test set. Train model on Training Set Evaluate model on Validation Set Tweak model according to results on Validation Set Pick model that does best on Validation Set Confirm results on Test Set

Figure 3. A better workflow.

In this improved workflow:

  1. Pick the model that does best on the validation set.
  2. Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.

Help Center

Validation: Programming Exercise

The following exercise dives more deeply into training and evaluating a model:

Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Validation programming exercise
  • Help Center

    Representation

    A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

    From Raw Data to Features

    The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

    Raw data is mapped to a feature vector through a process called feature engineering.image/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }} [ 6.0, 1.0, 0.0, 0.0, 0.0, 9.321, -2.20, 1.01, 0.0, ...,] Raw data doesn't come to us as feature vectors. Feature Engineering Process of creating features from raw data is feature engineering. Raw Data Feature Vector

    From Raw Data to Features

    An example of a feature that can be copied directly from the raw dataimage/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }} num_rooms_feature = [ 6.0 ] Feature Engineering Raw Data Feature Real-valued features can be copied over directly.

    From Raw Data to Features

    image/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }}

    From Raw Data to Features

    Mapping a string value ("Main Street") to a sparse vector, via one-hot encoding.image/svg+xmlFeature Engineering 0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }} String Features can be handled with one-hot encoding. street_name feature = [0, 0, ..., 0, 1, 0, ..., 0] V: number of unique vocab items (streets) One-hot encodingThis has a 1 for "Main Street" and 0 for all others. Raw Data Feature
    • Dictionary maps each street name to an int in {0, ...,V-1}
    • Now represent one-hot vector above as <i>

    Properties of a Good Feature

    Feature values should appear with non-zero value more than a small handful of times in the dataset.

    my_device_id:8SK982ZZ1242Z

    device_model:galaxy_s6

    Properties of a Good Feature

    Features should have a clear, obvious meaning.

    user_age:23

    user_age:123456789

    Properties of a Good Feature

    Features shouldn't take on "magic" values

    (use an additional boolean feature like is_watch_time_defined instead!)

    watch_time: -1.0

    watch_time: 1.023

    watch_time_is_defined: 1.0

    Properties of a Good Feature

    The definition of a feature shouldn't change over time.

    (Beware of depending on other ML systems!)

    city_id:"br/sao_paulo"

    inferred_city_cluster_id:219

    Properties of a Good Feature

    Distribution should not have crazy outliers

    Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

    Distribution with outliers and a distribution with a cap image/svg+xml 50 rooms per person!? roomsPerPerson roomsPerPerson Same feature,capped tomax of 4.0

    The Binning Trick

    Graph showing a distribution with a fitting curve based on location image/svg+xml Latitude

    The Binning Trick

    Graph showing a distribution with a fitting curve based on location image/svg+xml LatitudeBin1 = 32 < latitude <= 33 LatitudeBin6 = 37 < latitude <= 38 Latitude
    • Create several boolean bins, each mapping to a new unique feature
    • Allows model to fit a different value for each bin

    Good Habits

    KNOW YOUR DATA

    • Visualize: Plot histograms, rank most to least common.
    • Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
    • Monitor: Feature quantiles, number of examples over time?
    Help Center

    Representation: Feature Engineering

    In traditional programming, the focus is on code. In machine learning projects, the focus shifts to representation. That is, one way developers hone a model is by adding and improving its features.

    Mapping Raw Data to Features

    The left side of Figure 1 illustrates raw data from an input data source; the right side illustrates a feature vector, which is the set of floating-point values comprising the examples in your data set. Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.

    Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

    Raw data is mapped to a feature vector through a process called feature engineering.image/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Shorebird Way" num_basement_rooms: -1 ... }} [ 6.0, 1.0, 0.0, 0.0, 0.0, 9.321, -2.20, 1.01, 0.0, ...,] Raw data doesn't come to us as feature vectors. Feature Engineering Process of creating features from raw data is feature engineering. Raw Data Feature Vector

    Figure 1. Feature engineering maps raw data to ML features.

    Mapping numeric values

    Integer and floating-point data don't need a special encoding because they can be multiplied by a numeric weight. As suggested in Figure 2, converting the raw integer value 6 to the feature value 6.0 is trivial:

    An example of a feature that can be copied directly from the raw dataimage/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Shorebird Way" num_basement_rooms: -1 ... }} num_rooms_feature = [ 6.0 ] Feature Engineering Raw Data Feature Real-valued features can be copied over directly.

    Figure 2. Mapping integer values to floating-point values.

    Mapping categorical values

    Categorical features have a discrete set of possible values. For example, there might be a feature called street_name with options that include:

    {'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}
    

    Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.

    We can accomplish this by defining a mapping from the feature values, which we'll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all "other" category, known as an OOV (out-of-vocabulary) bucket.

    Using this approach, here's how we can map our street names to numbers:

    However, if we incorporate these index numbers directly into our model, it will impose some constraints that might be problematic:

    To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:

    The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.

    Figure 3 illustrates a one-hot encoding of a particular street: Shorebird Way. The element in the binary vector for Shorebird Way has a value of 1, while the elements for all other streets have values of 0.

    Mapping a string value ("Shorebird Way") to a sparse vector, via one-hot encoding.image/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Shorebird Way" num_basement_rooms: -1 ... }} String Features can be handled with one-hot encoding street_name feature = [0, 0, ..., 0, 1, 0, ..., 0] V: number of unique vocab items (streets) Feature Engineering One-hot encodingThis has a 1 for "Shorebird Way" and 0 for all others Raw Data Feature

    Figure 3. Mapping street address via one-hot encoding.

    This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way.

    Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights.

    Sparse Representation

    Suppose that you had 1,000,000 different street names in your data set that you wanted to include as values for street_name. Explicitly creating a binary vector of 1,000,000 elements where only 1 or 2 elements are true is a very inefficient representation in terms of both storage and computation time when processing these vectors. In this situation, a common approach is to use a sparse representation in which only nonzero values are stored. In sparse representations, an independent model weight is still learned for each feature value, as described above.

    Help Center

    Representation: Qualities of Good Features

    We've explored ways to map raw data into suitable feature vectors, but that's only part of the work. We must now explore what kinds of values actually make good features within those feature vectors.

    Avoid rarely used discrete feature values

    Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label. For example, a house_type feature would likely contain many examples in which its value was victorian:

    house_type: victorian
    

    Conversely, if a feature's value appears only once or very rarely, the model can't make predictions based on that feature. For example, unique_house_id is a bad feature because each value would be used only once, so the model couldn't learn anything from it:

    unique_house_id: 8SK982ZZ1242Z
    

    Prefer clear and obvious meanings

    Each feature should have a clear and obvious meaning to anyone on the project. For example, consider the following good feature for a house's age, which is instantly recognizable as the age in years:

    house_age: 27
    

    Conversely, the meaning of the following feature value is pretty much indecipherable to anyone but the engineer who created it:

    house_age: 851472000
    

    In some cases, noisy data (rather than bad engineering choices) causes unclear values. For example, the following user_age came from a source that didn't check for appropriate values:

    user_age: 277
    

    Don't mix "magic" values with actual data

    Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values. For example, suppose a feature holds a floating-point value between 0 and 1. So, values like the following are fine:

    quality_rating: 0.82
    quality_rating: 0.37
    

    However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:

    quality_rating: -1
    

    To work around magic values, convert the feature into two features:

    Account for upstream instability

    The definition of a feature shouldn't change over time. For example, the following value is useful because the city name probably won't change. (Note that we'll still need to convert a string like "br/sao_paulo" to a one-hot vector.)

    city_id: "br/sao_paulo"
    

    But gathering a value inferred by another model carries additional costs. Perhaps the value "219" currently represents Sao Paulo, but that representation could easily change on a future run of the other model:

    inferred_city_cluster: "219"
    
    Help Center

    Representation: Cleaning Data

    Apple trees produce some mixture of great fruit and wormy messes. Yet the apples in high-end grocery stores display 100% perfect fruit. Between orchard and grocery, someone spends significant time removing the bad apples or throwing a little wax on the salvageable ones. As an ML engineer, you'll spend enormous amounts of your time tossing out bad examples and cleaning up the salvageable ones. Even a few "bad apples" can spoil a large data set.

    Scaling feature values

    Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

    You don't have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.

    Handling extreme outliers

    The following plot represents a feature called roomsPerPerson from the California Housing data set. The value of roomsPerPerson was calculated by dividing the total number of rooms for an area by the population for that area. The plot shows that the vast majority of areas in California have one or two rooms per person. But take a look along the x-axis.

    A plot of roomsPerPerson in which nearly all the values are clustered between 0 and 4, but there's a verrrrry long tail reaching all the way out to 55 rooms per person image/svg+xml50 rooms per person!?

    Figure 4. A verrrrry lonnnnnnng tail.

    How could we minimize the influence of those extreme outliers? Well, one way would be to take the log of every value:

    A plot of log(roomsPerPerson) in which 99% of values cluster between about 0.4 and 1.8, but there's still a longish tail that goes out to 4.2 or so. image/svg+xmlBetter, but still some large outlier values

    Figure 5. Logarithmic scaling still leaves a tail.

    Log scaling does a slightly better job, but there's still a significant tail of outlier values. Let's pick yet another approach. What if we simply "cap" or "clip" the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

    A plot of roomsPerPerson in which all values lie between -0.3 and 4.0. The plot is bell-shaped, but there's an anomalous hill at 4.0 image/svg+xmlDistribution looks more normal... ...with one artifact

    Figure 6. Clipping feature values at 4.0

    Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.

    Binning

    The following plot shows the relative prevalence of houses at different latitudes in California. Notice the clustering—Los Angeles is about at latitude 34 and San Francisco is roughly at latitude 38.

    A plot of houses per latitude. The plot is highly irregular, containing doldrums around latitude 36 and huge spikes around latitudes 34 and 38. image/svg+xmllatitude

    Figure 7. Houses per latitude.

    In the data set, latitude is a floating-point value. However, it doesn't make sense to represent latitude as a floating-point feature in our model. That's because no linear relationship exists between latitude and housing values. For example, houses in latitude 35 are not 35/34 more expensive (or less expensive) than houses at latitude 34. And yet, individual latitudes probably are a pretty good predictor of house values.

    To make latitude a helpful predictor, let's divide latitudes into "bins" as suggested by the following figure:

    A plot of houses per latitude. The plot is divided into "bins" between whole number latitudes. latitude LatitudeBin1 = 32 < latitude <= 33 LatitudeBin6 = 37 < latitude <= 38

    Figure 8. Binning values.

    Instead of having one floating-point feature, we now have 11 distinct boolean features (LatitudeBin1, LatitudeBin2, ..., LatitudeBin11). Having 11 separate features is somewhat inelegant, so let's unite them into a single 11-element vector. Doing so will enable us to represent latitude 37.4 as follows:

    [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
    

    Thanks to binning, our model can now learn completely different weights for each latitude.

    Scrubbing

    Until now, we've assumed that all the data used for training and testing was trustworthy. In real-life, many examples in data sets are unreliable due to one or more of the following:

    Once detected, you typically "fix" bad examples by removing them from the data set. To detect omitted values or duplicated examples, you can write a simple program. Detecting bad feature values or labels can be far trickier.

    In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:

    Consider generating lists of the most common values for discrete features. For example, do the number of examples with country:uk match the number you expect. Should language:jp really be the most common language in your data set?

    Know your data

    Follow these rules:

    Treat your data with all the care that you would treat any mission-critical code. Good ML relies on good data.

    Additional Information

    Rules of Machine Learning, ML Phase II: Feature Engineering

    Help Center

    Representation: Programming Exercise

    In this programming exercise, you'll create a good, minimal set of features:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Feature Sets programming exercise
  • Help Center

    Feature Crosses

    A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.

    Feature Crosses

    • Feature crosses is the name of this approach
    • Define templates of the form [A x B]
    • Can be complex: [A x B x C x D x E]
    • When A and B represent boolean features, such as bins, the resulting crosses can be extremely sparse

    Feature Crosses: Some Examples

    • Housing market price predictor:

      [latitude X num_bedrooms]

    Feature Crosses: Some Examples

    • Housing market price predictor:

      [latitude X num_bedrooms]

    • Tic-Tac-Toe predictor:

      [pos1 x pos2 x ... x pos9]

    Feature Crosses: Why would we do this?

    • Linear learners use linear models
    • Such learners scale well to massive data e.g., vowpal-wabit, sofia-ml
    • But without feature crosses, the expressivity of these models would be limited
    • Using feature crosses + massive data is one efficient strategy for learning highly complex models
      • Foreshadowing: neural nets provide another
    Help Center

    Feature Crosses: Encoding Nonlinearity

    In Figures 1 and 2, imagine the following:

    Blues dots occupy the northeast quadrant; orange dots occupy the southwest quadrant.

    Figure 1. Is this a linear problem?

    Can you draw a line that neatly separates the sick trees from the healthy trees? Sure. This is a linear problem. The line won't be perfect. A sick tree or two might be on the "healthy" side, but your line will be a good predictor.

    Now look at the following figure:

    Blues dots occupy the northeast and southwest quadrants; orange dots occupy the northwest and southeast quadrants.

    Figure 2. Is this a linear problem?

    Can you draw a single straight line that neatly separates the sick trees from the healthy trees? No, you can't. This is a nonlinear problem. Any line you draw will be a poor predictor of tree health.

    Same drawing as Figure 2, except that a horizontal line breaks the plane. Blue and orange dots are above the line; blue and orange dots are below the line.

    Figure 3. A single line can't separate the two classes.

     

    To solve the nonlinear problem shown in Figure 2, create a feature cross. A feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together. (The term cross comes from cross product.) Let's create a feature cross named \(x_3\) by crossing \(x_1\) and \(x_2\):

    $$x_3 = x_1x_2$$

    We treat this newly minted \(x_3\) feature cross just like any other feature. The linear formula becomes:

    $$y = b + w_1x_1 + w_2x_2 + w_3x_3$$

    A linear algorithm can learn a weight for \(w_3\) just as it would for \(w_1\) and \(w_2\). In other words, although \(w_3\) encodes nonlinear information, you don’t need to change how the linear model trains to determine the value of \(w_3\).

    Kinds of feature crosses

    We can create many different kinds of feature crosses. For example:

    Thanks to stochastic gradient descent, linear models can be trained efficiently. Consequently, supplementing scaled linear models with feature crosses has traditionally been an efficient way to train on massive-scale data sets.

    Help Center

    Feature Crosses: Crossing One-Hot Vectors

    So far, we've focused on feature-crossing two individual floating-point features. In practice, machine learning models seldom cross continuous features. However, machine learning models do frequently cross one-hot feature vectors. Think of feature crosses of one-hot feature vectors as logical conjunctions. For example, suppose we have two features: country and language. A one-hot encoding of each generates vectors with binary features that can be interpreted as country=USA, country=France or language=English, language=Spanish. Then, if you do a feature cross of these one-hot encodings, you get binary features that can be interpreted as logical conjunctions, such as:

      country:usa AND language:spanish
    

    As another example, suppose you bin latitude and longitude, producing separate one-hot five-element feature vectors. For instance, a given latitude and longitude could be represented as follows:

      binned_latitude = [0, 0, 0, 1, 0]
      binned_longitude = [0, 1, 0, 0, 0]
    

    Suppose you create a feature cross of these two feature vectors:

      binned_latitude X binned_longitude
    

    This feature cross is a 25-element one-hot vector (24 zeroes and 1 one). The single 1 in the cross identifies a particular conjunction of latitude and longitude. Your model can then learn particular associations about that conjunction.

    Suppose we bin latitude and longitude much more coarsely, as follows:

    binned_latitude(lat) = [
      0  < lat <= 10
      10 < lat <= 20
      20 < lat <= 30
    ]
    binned_longitude(lon) = [
      0  < lon <= 15
      15 < lon <= 30
    ]
    

    Creating a feature cross of those coarse bins leads to synthetic feature having the following meanings:

    binned_latitude_X_longitude(lat, lon) = [
      0  < lat <= 10 AND 0  < lon <= 15
      0  < lat <= 10 AND 15 < lon <= 30
      10 < lat <= 20 AND 0  < lon <= 15
      10 < lat <= 20 AND 15 < lon <= 30
      20 < lat <= 30 AND 0  < lon <= 15
      20 < lat <= 30 AND 15 < lon <= 30
    ]
    

    Now suppose our model needs to predict how satisfied dog owners will be with dogs based on two features:

    If we build a feature cross from both these features:

      [behavior type X time of day]
    

    then we'll end up with vastly more predictive ability than either feature on its own. For example, if a dog cries (happily) at 5:00 pm when the owner returns from work will likely be a great positive predictor of owner satisfaction. Crying (miserably, perhaps) at 3:00 am when the owner was sleeping soundly will likely be a strong negative predictor of owner satisfaction.

    Linear learners scale well to massive data. Using feature crosses on massive data sets is one efficient strategy for learning highly complex models. Neural networks provide another strategy.

    Help Center

    Feature Crosses: Playground Exercises

    Introducing Feature Crosses

    Can a feature cross truly enable a model to fit nonlinear data? To find out, try this exercise.

    Task: Try to create a model that separates the blue dots from the orange dots by manually changing the weights of the following three input features:

    To manually change a weight:

    1. Click on a line that connects FEATURES to OUTPUT. An input form will appear.
    2. Type a floating-point value into that input form.
    3. Press Enter.

    Note that the interface for this exercise does not contain a Step button. That's because this exercise does not iteratively train a model. Rather, you will manually enter the "final" weights for the model.

    (Answers appear just below the exercise.)




    More Complex Feature Crosses

    Now let's play with some advanced feature cross combinations. The data set in this Playground exercise looks a bit like a noisy bullseye from a game of darts, with the blue dots in the middle and the orange dots in an outer ring.

    Task 1: Run this linear model as given. Spend a minute or two (but no longer) trying different learning rate settings to see if you can find any improvements. Can a linear model produce effective results for this data set?

    Task 2: Now try adding in cross-product features, such as x1x2, trying to optimize performance.

    Task 3: When you have a good model, examine the model output surface (shown by the background color).

    1. Does it look like a linear model?
    2. How would you describe the model?

    (Answers appear just below the exercise.)



    Help Center

    Feature Crosses: Programming Exercise

    In the following exercise, you'll explore feature crosses in TensorFlow:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Feature crosses programming exercise
  • Help Center

    Feature Crosses: Check Your Understanding

    Explore the options below.

    Different cities in California have markedly different housing prices. Suppose you must create a model to predict housing prices. Which of the following sets of features or feature crosses could learn city-specific relationships between roomsPerPerson and housing price?
    Three separate binned features: [binned latitude], [binned longitude], [binned roomsPerPerson]
    Binning is good because it enables the model to learn nonlinear relationships within a single feature. However, a city exists in more than one dimension, so learning city-specific relationships requires crossing latitude and longitude.
    One feature cross: [latitude X longitude X roomsPerPerson]
    In this example, crossing real-valued features is not a good idea. Crossing the real value of, say, latitude with roomsPerPerson enables a 10% change in one feature (say, latitude) to be equivalent to a 10% change in the other feature (say, roomsPerPerson).
    One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]
    Crossing binned latitude with binned longitude enables the model to learn city-specific effects of roomsPerPerson. Binning prevents a change in latitude producing the same result as a change in longitude. Depending on the granularity of the bins, this feature cross could learn city-specific or neighborhood-specific or even block-specific effects.
    Two feature crosses: [binned latitude X binned roomsPerPerson] and [binned longitude X binned roomsPerPerson]
    Binning is a good idea; however, a city is the conjunction of latitude and longitude, so separate feature crosses prevent the model from learning city-specific prices.
    Help Center

    Regularization for Simplicity: Playground Exercise

    Overcrossing?

    Before you watch the video or read the documentation, please complete this exercise that explores overuse of feature crosses.

    Task 1: Run the model as is, with all of the given cross-product features. Are there any surprises in the way the model fits the data? What is the issue?

    Task 2: Try removing various cross-product features to improve performance (albeit only slightly). Why would removing features improve performance?

    (Answers appear just below the exercise.)



    Help Center

    Regularization for Simplicity

    Regularization means penalizing the complexity of a model to reduce overfitting.

    Generalization Curve

    The loss function for the training set gradually declines. By contrast, the loss function for the validation set declines, but then starts to rise. image/svg+xmlTraining Data Validation Data Iterations Loss

    Penalizing Model Complexity

    • We want to avoid model complexity where possible.
    • We can bake this idea into the optimization we do at training time.
    • Empirical Risk Minimization:
      • aims for low training error
      • $$ \text{minimize: } Loss(Data\;|\;Model) $$

    Penalizing Model Complexity

    • We want to avoid model complexity where possible.
    • We can bake this idea into the optimization we do at training time.
    • Structural Risk Minimization:
      • aims for low training error
      • while balancing against complexity
      • $$ \text{minimize: } Loss(Data\;|\;Model) + complexity(Model) $$

    Regularization

    • How to define complexity(Model)?

    Regularization

    • How to define complexity(Model)?
    • Prefer smaller weights

    Regularization

    • How to define complexity(Model)?
    • Prefer smaller weights
    • Diverging from this should incur a cost
    • Can encode this idea via L2 regularization (a.k.a. ridge)
      • complexity(model) = sum of the squares of the weights
      • Penalizes really big weights
      • For linear models: prefers flatter slopes
      • Bayesian prior:
        • weights should be centered around zero
        • weights should be normally distributed

    A Loss Function with L2 Regularization

    $$ L(\boldsymbol{w}, D)\;+\;\lambda\;||\;\boldsymbol{w}\;||\;_2^2 $$

    \(\text{Where:}\)

    \(L\text{: Aim for low training error}\) \(\lambda\text{: A scalar value that controls how weights are balanced}\) \(\boldsymbol{w}\text{: Balances against complexity}\) \(^2_2\text{: The square of the}\;L_2\;\text{normalization of w}\)

    Help Center

    Regularization for Simplicity: L₂ Regularization

    Consider the following generalization curve, which shows the loss for both the training set and validation set against the number of training iterations.

    The loss function for the training set gradually declines. By contrast, the loss function for the validation set declines, but then starts to rise. image/svg+xmlTraining Data Validation Data Iterations Loss

    Figure 1. Loss on training set and validation set.

    Figure 1 shows a model in which training loss gradually decreases, but validation loss eventually goes up. In other words, this generalization curve shows that the model is overfitting to the data in the training set. Channeling our inner Ockham, perhaps we could prevent overfitting by penalizing complex models, a principle called regularization.

    In other words, instead of simply aiming to minimize loss (empirical risk minimization):

    $$\text{minimize(Loss(Data|Model))}$$

    we'll now minimize loss+complexity, which is called structural risk minimization:

    $$\text{minimize(Loss(Data|Model) + complexity(Model))}$$

    Our training optimization algorithm is now a function of two terms: the loss term, which measures how well the model fits the data, and the regularization term, which measures model complexity.

    Machine Learning Crash Course focuses on two common (and somewhat related) ways to think of model complexity:

    If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.

    We can quantify complexity using the L2 regularization formula, which defines the regularization term as the sum of the squares of all the feature weights:

    $$L_2\text{ regularization term} = ||\boldsymbol w||_2^2 = {w_1^2 + w_2^2 + ... + w_n^2}$$

    In this formula, weights close to zero have little effect on model complexity, while outlier weights can have a huge impact.

    For example, a linear model with the following weights:

    $$\{w_1 = 0.2, w_2 = 0.5, w_3 = 5, w_4 = 1, w_5 = 0.25, w_6 = 0.75\}$$

    Has an L2 regularization term of 26.915:

    $$w_1^2 + w_2^2 + \boldsymbol{w_3^2} + w_4^2 + w_5^2 + w_6^2$$ $$= 0.2^2 + 0.5^2 + \boldsymbol{5^2} + 1^2 + 0.25^2 + 0.75^2$$ $$= 0.04 + 0.25 + \boldsymbol{25} + 1 + 0.0625 + 0.5625$$ $$= 26.915$$

    But \(w_3\) (bolded above), with a squared value of 25, contributes nearly all the complexity. The sum of the squares of all five other weights adds just 1.915 to the L2 regularization term.

    Help Center

    Regularization for Simplicity: Lambda

    Model developers tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda (also called the regularization rate). That is, model developers aim to do the following:

    $$\text{minimize(Loss(Data|Model)} + \lambda \text{ complexity(Model))}$$

    Performing L2 regularization has the following effect on a model

    Increasing the lambda value strengthens the regularization effect. For example, the histogram of weights for a high value of lambda might look as shown in Figure 2.

    Histogram of a model's weights with a mean of zero and a normal distribution. image/svg+xml Distribution of Model Weights Weight Value Weight Frequency

    Figure 2. Histogram of weights.

    Lowering the value of lambda tends to yield a flatter histogram, as shown in Figure 3.

    Histogram of a model's weights with a mean of zero that is somewhere between a flat distribution and a normal distribution. image/svg+xml Weihght Weight Frequency Weight Frequ4enc Weight Frequency Distribution of Model Weights Weight Value Weight Frequency

    Figure 3. Histogram of weights produced by a lower lambda value.

    When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:

    The ideal value of lambda produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll need to do some tuning.

    Help Center

    Regularization for Simplicity: Playground Exercise

    Examining L2 regularization

    This exercise contains a small, noisy training data set. In this kind of setting, overfitting is a real concern. Fortunately, regularization might help.

    This exercise consists of three related tasks. To simplify comparisons across the three tasks, run each task in a separate tab.

    (Answers appear just below the exercise.)



    Help Center

    Regularization for Simplicity: Check Your Understanding

    L2 Regularization

    Explore the options below.

    Imagine a linear model with 100 input features:
  • 10 are highly informative.
  • 90 are non-informative.
  • Assume that all features have values between -1 and 1. Which of the following statements are true?
    L2 regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.
    Yes, L2 regularization encourages weights to be near 0.0, but not exactly 0.0.
    L2 regularization will encourage most of the non-informative weights to be exactly 0.0.
    L2 regularization does not tend to force weights to exactly 0.0. L2 regularization penalizes larger weights more than smaller weights. As a weight gets close to 0.0, L2 "pushes" less forcefully toward 0.0.
    L2 regularization may cause the model to learn a moderate weight for some non-informative features.
    Surprisingly, this can happen when a non-informative feature happens to be correlated with the label. In this case, the model incorrectly gives such non-informative features some of the "credit" that should have gone to informative features.

    L2 Regularization and Correlated Features

    Explore the options below.

    Imagine a linear model with two strongly correlated features; that is, these two features are nearly identical copies of one another but one feature contains a small amount of random noise. If we train this model with L2 regularization, what will happen to the weights for these two features?
    Both features will have roughly equal, moderate weights.
    L2 regularization will force the features towards roughly equivalent weights that are approximately half of what they would have been had only one of the two features been in the model.
    One feature will have a large weight; the other will have a weight of almost 0.0.
    L2 regularization penalizes large weights more than small weights. So, even if one weight started to drop faster than the other, L2 regularization would tend to force the bigger weight to drop more quickly than the smaller weight.
    One feature will have a large weight; the other will have a weight of exactly 0.0.
    L2 regularization rarely forces weights to exactly 0.0. By contrast, L1 regularization (discussed later) does force weights to exactly 0.0.
    Help Center

    Logistic Regression

    Instead of predicting exactly 0 or 1, logistic regression generates a probability—a value between 0 and 1, exclusive. For example, consider a logistic regression model for spam detection. If the model infers a value of 0.932 on a particular email message, it implies a 93.2% probability that the email message is spam. More precisely, it means that in the limit of infinite training examples, the set of examples for which the model predicts 0.932 will actually be spam 93.2% of the time and the remaining 6.8% will not.

    Predicting Coin Flips?

    • Imagine the problem of predicting probability of Heads for bent coins
    • You might use features like angle of bend, coin mass, etc.
    • What's the simplest model you could use?
    • What could go wrong?
    2 coins bent

    Logistic Regression

    • Many problems require a probability estimate as output
    • Enter Logistic Regression

    Logistic Regression

    • Many problems require a probability estimate as output
    • Enter Logistic Regression
    • Handy because the probability estimates are calibrated
      • for example, p(house will sell) * price = expected outcome

    Logistic Regression

    • Many problems require a probability estimate as output
    • Enter Logistic Regression
    • Handy because the probability estimates are calibrated
      • for example, p(house will sell) * price = expected outcome
    • Also useful for when we need a binary classification
      • spam or not spam? → p(Spam)

    Logistic Regression -- Predictions

    $$ y' = \frac{1}{1 + e^{-(w^Tx+b)}} $$

    \(\text{Where:} \) \(x\text{: Provides the familiar linear model}\) \(1+e^{-(...)}\text{: Squish through a sigmoid}\)

    Graph of Logistics Equation image/svg+xmlLogOdds (Sum of x i * w i + b) Probability Output

    LogLoss Defined

    $$ LogLoss = \sum_{(x,y)\in D} -y\,log(y') - (1 - y)\,log(1 - y') $$

    image/svg+xml predicted value Log Loss target value = 1.0 target value = 0.0

    Logistic Regression and Regularization

    • Regularization is super important for logistic regression.
      • Remember the asymptotes
      • It'll keep trying to drive loss to 0 in high dimensions

    Logistic Regression and Regularization

    • Regularization is super important for logistic regression.
      • Remember the asymptotes
      • It'll keep trying to drive loss to 0 in high dimensions
    • Two strategies are especially useful:
      • L2 regularization (aka L2 weight decay) - penalizes huge weights.
      • Early stopping - limiting training steps or learning rate.

    Linear Logistic Regression

    • Linear logistic regression is extremely efficient.
      • Very fast training and prediction times.
      • Short / wide models use a lot of RAM.
    Help Center

    Logistic Regression: Calculating a Probability

    Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:

    Let's consider how we might use the probability "as is." Suppose we create a logistic regression model to predict the probability that a dog will bark during the middle of the night. We'll call that probability:

      p(bark | night)
    

    If the logistic regression model predicts a p(bark | night) of 0.05, then over a year, the dog's owners should be startled awake approximately 18 times:

      startled = p(bark | night) * nights
      18 ~= 0.05 * 365
    

    In many cases, you'll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., "spam" or "not spam"). A later module focuses on that.

    You might be wondering how a logistic regression model can ensure output that always falls between 0 and 1. As it happens, a sigmoid function, defined as follows, produces output having those same characteristics:

    $$y = \frac{1}{1 + e^{-z}}$$

    The sigmoid function yields the following plot:

    Sigmoid function. The x axis is the raw inference value. The y axis extends from 0 to +1, exclusive.

    Figure 1: Sigmoid function.

    If z represents the output of the linear layer of a model trained with logistic regression, then sigmoid(z) will yield a value (a probability) between 0 and 1. In mathematical terms:

    $$y' = \frac{1}{1 + e^{-(z)}}$$

    where:

    Note that z is also referred to as the log-odds because the inverse of the sigmoid states that z can be defined as the log of the probability of the "1" label (e.g., "dog barks") divided by the probability of the "0" label (e.g., "dog doesn't bark"):

    $$ z = log(\frac{y}{1-y}) $$

    Here is the sigmoid function with ML labels:

    The Sigmoid function with the x-axis labeled as the sum of all the weights and features (plus the bias); the y-axis is labeled Probability Output. image/svg+xmlProbability Output -(w 0 + w 1 x 1 + w 2 x 2 + w N x N )

    Figure 2: Logistic regression output.

    Help Center

    Logistic Regression: Model Training

    Loss function for Logistic Regression

    The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:

    $$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

    where:

    The equation for Log Loss is closely related to Shannon's Entropy measure from Information Theory. It is also the negative logarithm of the likelihood function, assuming a Bernoulli distribution of \(y\). Indeed, minimizing the loss function yields a maximum likelihood estimate.

    Regularization in Logistic Regression

    Regularization is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:

    (We'll discuss a third strategy—L1 regularization—in a later module.)

    Imagine that you assign a unique id to each example, and map each id to its own feature. If you don't specify a regularization function, the model will become completely overfit. That's because the model would try to drive loss to zero on all examples and never get there, driving the weights for each indicator feature to +infinity or -infinity. This can happen in high dimensional data with feature crosses, when there’s a huge mass of rare crosses that happen only on one example each.

    Fortunately, using L2 or early stopping will prevent this problem.

     

    Help Center

    Classification

    This module shows how logistic regression can be used for classification tasks, and explores how to evaluate the effectiveness of classification models.

    Classification vs. Regression

    • Sometimes, we use logistic regression for the probability outputs -- this is a regression in (0, 1)
    • Other times, we'll threshold the value for a discrete binary classification
    • Choice of threshold is an important choice, and can be tuned

    Evaluation Metrics: Accuracy

    • How do we evaluate classification models?

    Evaluation Metrics: Accuracy

    • How do we evaluate classification models?
    • One possible measure: Accuracy
      • the fraction of predictions we got right

    Accuracy Can Be Misleading

    • In many cases, accuracy is a poor or misleading metric
      • Most often when different kinds of mistakes have different costs
      • Typical case includes class imbalance, when positives or negatives are extremely rare

    True Positives and False Positives

    • For class-imbalanced problems, useful to separate out different kinds of errors
    True Positives
    We correctly called wolf!
    We saved the town.

    False Positives
    Error: we called wolf falsely.
    Everyone is mad at us.

    False Negatives
    There was a wolf, but we didn't spot it. It ate all our chickens.
    True Negatives
    No wolf, no alarm.
    Everyone is fine.

    Evaluation Metrics: Precision and Recall

    • Precision: (True Positives) / (All Positive Predictions)
      • When model said "positive" class, was it right?
      • Intuition: Did the model cry "wolf" too often?

    Evaluation Metrics: Precision and Recall

    • Precision: (True Positives) / (All Positive Predictions)
      • When model said "positive" class, was it right?
      • Intuition: Did the model cry "wolf" too often?
    • Recall: (True Positives) / (All Actual Positives)
      • Out of all the possible positives, how many did the model correctly identify?
      • Intuition: Did it miss any wolves?

    Explore the options below.

    Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to precision?
    Definitely increase.
    Raising the classification threshold typically increases precision; however, precision is not guaranteed to increase monotonically as we raise the threshold.
    Probably increase.
    In general, raising the classification threshold reduces false positives, thus raising precision.
    Probably decrease.
    In general, raising the classification threshold reduces false positives, thus raising precision.
    Definitely decrease.
    In general, raising the classification threshold reduces false positives, thus raising precision.

    A ROC Curve

    Each point is the TP and FP rate at one decision threshold.

    ROC Curve showing TP Rate vs. FP Rate at different classification thresholds. image/svg+xml FP Rate TP Rate

    Evaluation Metrics: AUC

    • AUC: "Area under the ROC Curve"

    Evaluation Metrics: AUC

    • AUC: "Area under the ROC Curve"
    • Interpretation:
      • If we pick a random positive and a random negative, what's the probability my model ranks them in the correct order?

    Evaluation Metrics: AUC

    • AUC: "Area under the ROC Curve"
    • Interpretation:
      • If we pick a random positive and a random negative, what's the probability my model ranks them in the correct order?
    • Intuition: gives an aggregate measure of performance aggregated across all possible classification thresholds

    Prediction Bias

    • Logistic Regression predictions should be unbiased.
      • average of predictions == average of observations

    Prediction Bias

    • Logistic Regression predictions should be unbiased.
      • average of predictions == average of observations
    • Bias is a canary.
      • Zero bias alone does not mean everything in your system is perfect.
      • But it's a great sanity check.

    Prediction Bias (continued)

    • If you have bias, you have a problem.
      • Incomplete feature set?
      • Buggy pipeline?
      • Biased training sample?
    • Don't fix bias with a calibration layer, fix it in the model.
    • Look for bias in slices of data -- this can guide improvements.

    Calibration Plots Show Bucketed Bias

    A calibration plot image/svg+xml Prediction Label Calibration scatter plot PredictionLabelOverpredictionLineCalibration LineUnderpredictionLine Each dot represents many examples in the same bucketed prediction range
    Help Center

    Classification: Thresholding

    Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).

    A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.

    The following sections take a closer look at metrics you can use to evaluate a classification model's predictions, as well as the impact of changing the classification threshold on these predictions.

    Help Center

    Classification: True vs. False and Positive vs. Negative

    In this section, we'll define the primary building blocks of the metrics we'll use to evaluate classification models. But first, a fable:

    An Aesop's Fable: The Boy Who Cried Wolf (compressed)

    A shepherd boy gets bored tending the town's flock. To have some fun, he cries out, "Wolf!" even though no wolf is in sight. The villagers run to protect the flock, but then get really mad when they realize the boy was playing a joke on them.

    [Iterate previous paragraph N times.]

    One night, the shepherd boy sees a real wolf approaching the flock and calls out, "Wolf!" The villagers refuse to be fooled again and stay in their houses. The hungry wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.

    Let's make the following definitions:

    We can summarize our "wolf-prediction" model using a 2x2 confusion matrix that depicts all four possible outcomes:

    True Positive (TP):
    • Reality: A wolf threatened.
    • Shepherd said: "Wolf."
    • Outcome: Shepherd is a hero.
    False Positive (FP):
    • Reality: No wolf threatened.
    • Shepherd said: "Wolf."
    • Outcome: Villagers are angry at shepherd for waking them up.
    False Negative (FN):
    • Reality: A wolf threatened.
    • Shepherd said: "No wolf."
    • Outcome: The wolf ate all the sheep.
    True Negative (TN):
    • Reality: No wolf threatened.
    • Shepherd said: "No wolf."
    • Outcome: Everyone is fine.

    A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

    A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

    In the following sections, we'll look at how to evaluate classification models using metrics derived from these four outcomes.

    Help Center

    Classification: Accuracy

    Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:

    $$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

    For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:

    $$\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}$$

    Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.

    Let's try calculating accuracy for the following model that classified 100 tumors as malignant (the positive class) or benign (the negative class):

    True Positive (TP):
    • Reality: Malignant
    • ML model predicted: Malignant
    • Number of TP results: 1
    False Positive (FP):
    • Reality: Benign
    • ML model predicted: Malignant
    • Number of FP results: 1
    False Negative (FN):
    • Reality: Malignant
    • ML model predicted: Benign
    • Number of FN results: 8
    True Negative (TN):
    • Reality: Benign
    • ML model predicted: Benign
    • Number of TN results: 90
    $$\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} = \frac{1+90}{1+90+1+8} = 0.91$$

    Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total examples). That means our tumor classifier is doing a great job of identifying malignancies, right?

    Actually, let's do a closer analysis of positives and negatives to gain more insight into our model's performance.

    Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1 TP and 8 FNs).

    Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good. However, of the 9 malignant tumors, the model only correctly identifies 1 as malignant—a terrible outcome, as 8 out of 9 malignancies go undiagnosed!

    While 91% accuracy may seem good at first glance, another tumor-classifier model that always predicts benign would achieve the exact same accuracy (91/100 correct predictions) on our examples. In other words, our model is no better than one that has zero predictive ability to distinguish malignant tumors from benign tumors.

    Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one, where there is a significant disparity between the number of positive and negative labels.

    In the next section, we'll look at two better metrics for evaluating class-imbalanced problems: precision and recall.

     

    Help Center

    Classification: Precision and Recall

    Precision

    Precision attempts to answer the following question:

    What proportion of positive identifications was actually correct?

    Precision is defined as follows:

    $$\text{Precision} = \frac{TP}{TP+FP}$$

    Let's calculate precision for our ML model from the previous section that analyzes tumors:

    True Positives (TPs): 1 False Positives (FPs): 1
    False Negatives (FNs): 8 True Negatives (TNs): 90
    $$\text{Precision} = \frac{TP}{TP+FP} = \frac{1}{1+1} = 0.5$$

    Our model has a precision of 0.5—in other words, when it predicts a tumor is malignant, it is correct 50% of the time.

    Recall

    Recall attempts to answer the following question:

    What proportion of actual positives was identified correctly?

    Mathematically, recall is defined as follows:

    $$\text{Recall} = \frac{TP}{TP+FN}$$

    Let's calculate recall for our tumor classifier:

    True Positives (TPs): 1 False Positives (FPs): 1
    False Negatives (FNs): 8 True Negatives (TNs): 90
    $$\text{Recall} = \frac{TP}{TP+FN} = \frac{1}{1+8} = 0.11$$

    Our model has a recall of 0.11—in other words, it correctly identifies 11% of all malignant tumors.

    Precision and Recall: A Tug of War

    To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. Explore this notion by looking at the following figure, which shows 30 predictions made by an email classification model. Those to the right of the classification threshold are classified as "spam", while those to the left are classified as "not spam."

    A number line from 0 to 1.0 on which 30 examples have been placed. Output of Logistic Regression model Actually not spam Actually spam 0.0 1.0 Classification Threshold TN TN TN TN TN TN TN TN TN TN TN TN TN TN TN FN TN FN TN FN FP TP FP TP TP TP TP TP TP TP

    Figure 1. Classifying email messages as spam or not spam.

    Let's calculate precision and recall based on the results shown in Figure 1:

    True Positives (TP): 8 False Positives (FP): 2
    False Negatives (FN): 3 True Negatives (TN): 17

    Precision measures the percentage of emails flagged as spam that were correctly classified—that is, the percentage of dots to the right of the threshold line that are green in Figure 1:

    $$\text{Precision} = \frac{TP}{TP + FP} = \frac{8}{8+2} = 0.8$$

    Recall measures the percentage of actual spam emails that were correctly classified—that is, the percentage of green dots that are to the right of the threshold line in Figure 1:

    $$\text{Recall} = \frac{TP}{TP + FN} = \frac{8}{8 + 3} = 0.73$$

    Figure 2 illustrates the effect of increasing the classification threshold.

    Same set of examples, but with the classification threshold increased slightly. 2 of the 30 examples have been reclassified. Output of Logistic Regression model Actually not spam Actually spam 0.0 1.0 ClassificationThreshold TN TN TN TN TN TN TN TN TN TN TN TN TN TN TN FN TN FN TN FN TN FN FP TP TP TP TP TP TP TP

    Figure 2. Increasing classification threshold.

    The number of false positives decreases, but false negatives increase. As a result, precision increases, while recall decreases:

    True Positives (TP): 7 False Positives (FP): 1
    False Negatives (FN): 4 True Negatives (TN): 18
    $$\text{Precision} = \frac{TP}{TP + FP} = \frac{7}{7+1} = 0.88$$ $$\text{Recall} = \frac{TP}{TP + FN} = \frac{7}{7 + 4} = 0.64$$

    Conversely, Figure 3 illustrates the effect of decreasing the classification threshold (from its original position in Figure 1).

    Same set of examples, but with the classification threshold decreased. Output of Logistic Regression model Actually not spam Actually spam 0.0 1.0 Classification Threshold TN TN TN TN TN TN TN TN TN TN TN TN TN TN TN FN TN FN FP TP FP TP FP TP TP TP TP TP TP TP

    Figure 3. Decreasing classification threshold.

    False positives increase, and false negatives decrease. As a result, this time, precision decreases and recall increases:

    True Positives (TP): 9 False Positives (FP): 3
    False Negatives (FN): 2 True Negatives (TN): 16
    $$\text{Precision} = \frac{TP}{TP + FP} = \frac{9}{9+3} = 0.75$$ $$\text{Recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 2} = 0.82$$

    Various metrics have been developed that rely on both precision and recall. For example, see F1 score.

     

    Help Center

    Classification: Check Your Understanding (Accuracy, Precision, Recall)

    Accuracy

    Explore the options below.

    In which of the following scenarios would a high accuracy value suggest that the ML model is doing a good job?
    A deadly, but curable, medical condition afflicts .01% of the population. An ML model uses symptoms as features and predicts this affliction with an accuracy of 99.99%.
    Accuracy is a poor metric here. After all, even a "dumb" model that always predicts "not sick" would still be 99.99% accurate. Mistakenly predicting "not sick" for a person who actually is sick could be deadly.
    An expensive robotic chicken crosses a very busy road a thousand times per day. An ML model evaluates traffic patterns and predicts when this chicken can safely cross the street with an accuracy of 99.99%.
    A 99.99% accuracy value on a very busy road strongly suggests that the ML model is far better than chance. In some settings, however, the cost of making even a small number of mistakes is still too high. 99.99% accuracy means that the expensive chicken will need to be replaced, on average, every 10 days. (The chicken might also cause extensive damage to cars that it hits.)
    In the game of roulette, a ball is dropped on a spinning wheel and eventually lands in one of 38 slots. Using visual features (the spin of the ball, the position of the wheel when the ball was dropped, the height of the ball over the wheel), an ML model can predict the slot that the ball will land in with an accuracy of 4%.
    This ML model is making predictions far better than chance; a random guess would be correct 1/38 of the time—yielding an accuracy of 2.6%. Although the model's accuracy is "only" 4%, the benefits of success far outweigh the disadvantages of failure.

    Precision

    Explore the options below.

    Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to precision?
    Definitely increase.
    Raising the classification threshold typically increases precision; however, precision is not guaranteed to increase monotonically as we raise the threshold.
    Probably increase.
    In general, raising the classification threshold reduces false positives, thus raising precision.
    Probably decrease.
    In general, raising the classification threshold reduces false positives, thus raising precision.
    Definitely decrease.
    In general, raising the classification threshold reduces false positives, thus raising precision.

    Recall

    Explore the options below.

    Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to recall?
    Always increase.
    Raising the classification threshold will cause both of the following:
    • The number of true positives will decrease or stay the same.
    • The number of false negatives will increase or stay the same.
    Thus, recall will never increase.
    Always decrease or stay the same.
    Raising our classification threshold will cause the number of true positives to decrease or stay the same and will cause the number of false negatives to increase or stay the same. Thus, recall will either stay constant or decrease.
    Always stay constant.
    Raising our classification threshold will cause the number of true positives to decrease or stay the same and will cause the number of false negatives to increase or stay the same. Thus, recall will either stay constant or decrease.

    Precision and Recall

    Explore the options below.

    Consider two models—A and B—that each evaluate the same dataset. Which one of the following statements is true?
    If Model A has better precision than model B, then model A is better.
    While better precision is good, it might be coming at the expense of a large reduction in recall. In general, we need to look at both precision and recall together, or summary metrics like AUC which we'll talk about next.
    If model A has better recall than model B, then model A is better.
    While better recall is good, it might be coming at the expense of a large reduction in precision. In general, we need to look at both precision and recall together, or summary metrics like AUC, which we'll talk about next.
    If model A has better precision and better recall than model B, then model A is probably better.
    In general, a model that outperforms another model on both precision and recall is likely the better model. Obviously, we'll need to make sure that comparison is being done at a precision / recall point that is useful in practice for this to be meaningful. For example, suppose our spam detection model needs to have at least 90% precision to be useful and avoid unnecessary false alarms. In this case, comparing one model at {20% precision, 99% recall} to another at {15% precision, 98% recall} is not particularly instructive, as neither model meets the 90% precision requirement. But with that caveat in mind, this is a good way to think about comparing models when using precision and recall.
    Help Center

    Classification: ROC and AUC

    ROC curve

    An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

    True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

    $$TPR = \frac{TP} {TP + FN}$$

    False Positive Rate (FPR) is defined as follows:

    $$FPR = \frac{FP} {FP + TN}$$

    An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

    ROC Curve showing TP Rate vs. FP Rate at different classification thresholds. FP Rate TP Rate 1 0 0 1 TP vs. FP rate at one decision threshold TP vs. FP rate at another decision threshold

    Figure 4. TP vs. FP rate at different classification thresholds.

    To compute the points in an ROC curve, we could evaluate a logistic regression model many times with different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide this information for us, called AUC.

    AUC: Area Under the ROC Curve

    AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

    AUC (Area under the ROC Curve). image/svg+xml FP Rate TP Rate

    Figure 5. AUC (Area under the ROC Curve).

    AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions:

    Positive and negative examples ranked in ascending order of logistic regression score image/svg+xml Actual Negative Actual Positive Output of Log. Reg. model N N N N N N N N N N N P N P P P P P P P N N N N N P N P N P 0.0 1.0

    Figure 6. Predictions ranked in ascending order of logistic regression score.

    AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

    AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

    AUC is desirable for the following two reasons:

    However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:

    Help Center

    Classification: Check Your Understanding (ROC and AUC)

    ROC and AUC

    Explore the options below.

    Which of the following ROC curves produce AUC values greater than 0.5?
    An ROC curve with one horizontal line running from (0,0) to (0,1), and another from (0,1) to (1,1). The TP rate is 1.0 for all FP rates. image/svg+xml FP Rate TP Rate

    This is the best possible ROC curve, as it ranks all positives above all negatives. It has an AUC of 1.0.

    In practice, if you have a "perfect" classifier with an AUC of 1.0, you should be suspicious, as it likely indicates a bug in your model. For example, you may have overfit to your training data, or the label data may be replicated in one of your features.

    An ROC curve with one horizontal line running from (0,0) to (0,1), and another from (0,1) to (1,1). The TP rate is 1.0 for all FP rates. image/svg+xml FP Rate TP Rate
    This is the worst possible ROC curve; it ranks all negatives above all positives, and has an AUC of 0.0. If you were to reverse every prediction (flip negatives to positives and postives to negatives), you'd actually have a perfect classifier!
    An ROC curve with one horizontal line running from (0,0) to (0,1), and another from (0,1) to (1,1). The TP rate is 1.0 for all FP rates. image/svg+xml FP Rate TP Rate
    This ROC curve has an AUC of 0.5, meaning it ranks a random positive example higher than a random negative example 50% of the time. As such, the corresponding classification model is basically worthless, as its predictive ability is no better than random guessing.
    An ROC curve with one horizontal line running from (0,0) to (0,1), and another from (0,1) to (1,1). The TP rate is 1.0 for all FP rates. image/svg+xml FP Rate TP Rate
    This ROC curve has an AUC between 0.5 and 1.0, meaning it ranks a random positive example higher than a random negative example more than 50% of the time. Real-world binary classification AUC values generally fall into this range.
    An ROC curve with one horizontal line running from (0,0) to (0,1), and another from (0,1) to (1,1). The TP rate is 1.0 for all FP rates. image/svg+xml FP Rate TP Rate
    This ROC curve has an AUC between 0 and 0.5, meaning it ranks a random positive example higher than a random negative example less than 50% of the time. The corresponding model actually performs worse than random guessing! If you see an ROC curve like this, it likely indicates there's a bug in your data.

    AUC and Scaling Predictions

    Explore the options below.

    How would multiplying all of the predictions from a given model by 2.0 (for example, if the model predicts 0.4, we multiply by 2.0 to get a prediction of 0.8) change the model's performance as measured by AUC?
    No change. AUC only cares about relative prediction scores.
    Yes, AUC is based on the relative predictions, so any transformation of the predictions that preserves the relative ranking has no effect on AUC. This is clearly not the case for other metrics such as squared error, log loss, or prediction bias (discussed later).
    It would make AUC terrible, since the prediction values are now way off.
    Interestingly enough, even though the prediction values are different (and likely farther from the truth), multiplying them all by 2.0 would keep the relative ordering of prediction values the same. Since AUC only cares about relative rankings, it is not impacted by any simple scaling of the predictions.
    It would make AUC better, because the prediction values are all farther apart.
    The amount of spread between predictions does not actually impact AUC. Even a prediction score for a randomly drawn true positive is only a tiny epsilon greater than a randomly drawn negative, that will count that as a success contributing to the overall AUC score.
    Help Center

    Classification: Prediction Bias

    Logistic regression predictions should be unbiased. That is:

    "average of predictions" should ≈ "average of observations"

    Prediction bias is a quantity that measures how far apart those two averages are. That is:

    $$\text{prediction bias} = \text{average of predictions} - \text{average of labels in data set}$$

    A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.

    For example, let's say we know that on average, 1% of all emails are spam. If we don't know anything at all about a given email, we should predict that it's 1% likely to be spam. Similarly, a good spam model should predict on average that emails are 1% likely to be spam. (In other words, if we average the predicted likelihoods of each individual email being spam, the result should be 1%.) If instead, the model's average prediction is 20% likelihood of being spam, we can conclude that it exhibits prediction bias.

    Possible root causes of prediction bias are:

    You might be tempted to correct prediction bias by post-processing the learned model—that is, by adding a calibration layer that adjusts your model's output to reduce the prediction bias. For example, if your model has +3% bias, you could add a calibration layer that lowers the mean prediction by 3%. However, adding a calibration layer is a bad idea for the following reasons:

    If possible, avoid calibration layers. Projects that use calibration layers tend to become reliant on them—using calibration layers to fix all their model's sins. Ultimately, maintaining the calibration layers can become a nightmare.

    Bucketing and Prediction Bias

    Logistic regression predicts a value between 0 and 1. However, all labeled examples are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for example, "spam"). Therefore, when examining prediction bias, you cannot accurately determine the prediction bias based on only one example; you must examine the prediction bias on a "bucket" of examples. That is, prediction bias for logistic regression only makes sense when grouping enough examples together to be able to compare a predicted value (for example, 0.392) to observed values (for example, 0.394).

    You can form buckets in the following ways:

    Consider the following calibration plot from a particular model. Each dot represents a bucket of 1,000 values. The axes have the following meanings:

    Both axes are logarithmic scales.

    X-axis is Prediction; y-axis is Label. For middle and high values of prediction, the prediction bias is negligible. For low values of prediction, the prediction bias is relatively high. image/svg+xml Calibration scatter plot Prediction Label

    Figure 8. Prediction bias curve (logarithmic scales)

    Why are the predictions so poor for only part of the model? Here are a few possibilities:

    Help Center

    Classification: Programming Exercise

    In the following exercise, you'll explore logistic regression and classification in TensorFlow:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Logistic Regression programming exercise
  • Help Center

    Regularization for Sparsity

    This module focuses on the special requirements for models learned on feature vectors that have many dimensions.

    Let's Go Back to Feature Crosses

    • Caveat: Sparse feature crosses may significantly increase feature space
    • Possible issues:
      • Model size (RAM) may become huge
      • "Noise" coefficients (causes overfitting)

    L1 Regularization

    • Would like to penalize L0 norm of weights
      • Non-convex optimization; NP-hard

    L1 Regularization

    • Would like to penalize L0 norm of weights
      • Non-convex optimization; NP hard
    • Relax to L1 regularization:
      • Penalize sum of abs(weights)
      • Convex problem
      • Encourage sparsity unlike L2
    Help Center

    Regularization for Sparsity: L₁ Regularization

    Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.

    In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.

    For example, consider a housing data set that covers not just California but the entire globe. Bucketing global latitude at the minute level (60 minutes per degree) gives about 10,000 dimensions in a sparse encoding; global longitude at the minute level gives about 20,000 dimensions. A feature cross of these two features would result in roughly 200,000,000 dimensions. Many of those 200,000,000 dimensions represent areas of such limited residence (for example, the middle of the ocean) that it would be difficult to use that data to generalize effectively. It would be silly to pay the RAM cost of storing these unneeded dimensions. Therefore, it would be nice to encourage the weights for the meaningless dimensions to drop to exactly 0, which would allow us to avoid paying for the storage cost of these model coefficients at inference time.

    We might be able to encode this idea into the optimization problem done at training time, by adding an appropriately chosen regularization term.

    Would L2 regularization accomplish this task? Unfortunately not. L2 regularization encourages weights to be small, but doesn't force them to exactly 0.0.

    An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model's ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem that's NP-hard. (If you squint, you can see a connection to the knapsack problem.) So this idea, known as L0 regularization isn't something we can use effectively in practice.

    However, there is a regularization term called L1 regularization that serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. So we can use L1 regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time.

    L1 vs L2 regularization.

    L2 and L1 penalize weights differently:

    Consequently, L2 and L1 have different derivatives:

    You can think of the derivative of L2 as a force that removes x% of the weight every time. As Zeno knew, even if you remove x percent of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.

    You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight.

    L1 regularization—penalizing the absolute value of all the weights—turns out to be quite efficient for wide models.

    Note that this description is true for a one-dimensional model.

    Click the Play button (play_arrow) below to compare the effect L1 and L2 regularization have on a network of weights.

    Help Center

    Regularization for Sparsity: Playground Exercise

    Examining L1 Regularization

    This exercise contains a small, slightly noisy, training data set. In this kind of setting, overfitting is a real concern. Regularization might help, but which form of regularization?

    This exercise consists of five related tasks. To simplify comparisons across the five tasks, run each task in a separate tab. Notice that the thicknesses of the lines connecting FEATURES and OUTPUT represent the relative weights of each feature.

    Task Regularization Type Regularization Rate (lambda)
    1 L2 0.1
    2 L2 0.3
    3 L1 0.1
    4 L1 0.3
    5 L1 experiment

    Questions:

    1. How does switching from L2 to L1 regularization influence the delta between test loss and training loss?
    2. How does switching from L2 to L1 regularization influence the learned weights?
    3. How does increasing the L1 regularization rate (lambda) influence the learned weights?

    (Answers appear just below the exercise.)



     

    Help Center

    Regularization for Sparsity: Programming Exercise

    In the following exercise, you'll explore L1 regularization in TensorFlow:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Sparsity and L1 Regularization programming exercise
  • Help Center

    Regularization for Sparsity: Check Your Understanding

    L1 regularization

    Explore the options below.

    Imagine a linear model with 100 input features:
  • 10 are highly informative.
  • 90 are non-informative.
  • Assume that all features have values between -1 and 1. Which of the following statements are true?
    L1 regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.
    In general, L1 regularization of sufficient lambda tends to encourage non-informative features to weights of exactly 0.0. Unlike L2 regularization, L1 regularization "pushes" just as hard toward 0.0 no matter how far the weight is from 0.0.
    L1 regularization will encourage most of the non-informative weights to be exactly 0.0.
    L1 regularization of sufficient lambda tends to encourage non-informative weights to become exactly 0.0. By doing so, these non-informative features leave the model.
    L1 regularization may cause informative features to get a weight of exactly 0.0.
    Be careful--L1 regularization may cause the following kinds of features to be given weights of exactly 0:
  • Weakly informative features.
  • Strongly informative features on different scales.
  • Informative features strongly correlated with other similarly informative features.
  • L1 vs. L2 Regularization

    Explore the options below.

    Imagine a linear model with 100 input features, all having values between -1 and 1:
  • 10 are highly informative.
  • 90 are non-informative.
  • Which type of regularization will produce the smaller model?
    L2 regularization.
    L2 regularization rarely reduces the number of features. In other words, L2 regularization rarely reduces the model size.
    L1 regularization.
    L1 regularization tends to reduce the number of features. In other words, L1 regularization often reduces the model size.
    Help Center

    Introduction to Neural Networks

    Neural networks are a more sophisticated version of feature crosses. In essence, neural networks learn the appropriate feature crosses for you.

    A Linear Model

    Three blue circles in a row connected by arrows to a green circle above them image/svg+xml Input Output

    Add Complexity: Non-Linear?

    Three blue circles in a row labeled "Input" connected by arrows to a row of yellow circles labeled "Hidden Layer" above them, which are in turn connected to a green circle labeled "Output" at the top. image/svg+xml Output Hidden Layer Input

    More Complex: Non-Linear?

    image/svg+xml Output Hidden Layer 2 Hidden Layer 1 Input

    Adding a Non-Linearity

    The same as the previous figure, except that a row of pink circles labeled 'Non-Linear Transformation Layer' has been added in between the two hidden layers. image/svg+xml Output Hidden Layer 2 Non-Linear Transformation Layer (a.k.a. Activation Function) Hidden Layer 1 Input We Usually Don't Draw Non-Linear Transforms

    Our Favorite Non-Linearity

    A graph with slope of 0 and then linear once it passes x=0 image/svg+xml Relu Rectified Linear Unit F(x)=max(0,x)

    Neural Nets Can Be Arbitrarily Complex

    A complex neural network image/svg+xml Hidden2 Hidden1 Input Output
    Help Center

    Introduction to Neural Networks: Anatomy

    If you recall from the Feature Crosses unit, the following classification problem is nonlinear:

    Cartesian plot. Traditional x axis is labeled 'x1'. Traditional y axis is labeled 'x2'. Blue dots occupy the northwest and southeast quadrants; yellow dots occupy the southwest and northeast quadrants.

    Figure 1. Nonlinear classification problem.

    "Nonlinear" means that you can't accurately predict a label with a model of the form $$b + w_1x_1 + w_2x_2$$ In other words, the "decision surface" is not a line. Previously, we looked at feature crosses as one possible approach to modeling nonlinear problems.

    Now consider the following data set:

    Data set contains many orange and many blue dots. It is hard to determine a coherent pattern, but the orange dots vaguely form a spiral and the blue dots perhaps form a different spiral.

    Figure 2. A more difficult nonlinear classification problem.

    The data set shown in Figure 2 can't be solved with a linear model.

    To see how neural networks might help with nonlinear problems, let's start by representing a linear model as a graph:

    Three blue circles in a row connected by arrows to a green circle above them image/svg+xml Input Output

    Figure 3. Linear model as graph.

    Each blue circle represents an input feature, and the green circle represents the weighted sum of the inputs.

    How can we alter this model to improve its ability to deal with nonlinear problems?

    Hidden Layers

    In the model represented by the following graph, we've added a "hidden layer" of intermediary values. Each yellow node in the hidden layer is a weighted sum of the blue input node values. The output is a weighted sum of the yellow nodes.

    Three blue circles in a row labeled "Input" connected by arrows to a row of yellow circles labeled "Hidden Layer" above them, which are in turn connected to a green circle labeled "Output" at the top. image/svg+xml Output Hidden Layer Input

    Figure 4. Graph of two-layer model.

    Is this model linear? Yes—its output is still a linear combination of its inputs.

    In the model represented by the following graph, we've added a second hidden layer of weighted sums.

    image/svg+xml Output Hidden Layer 2 Hidden Layer 1 Input

    Figure 5. Graph of three-layer model.

    Is this model still linear? Yes, it is. When you express the output as a function of the input and simplify, you get just another weighted sum of the inputs. This sum won't effectively model the nonlinear problem in Figure 2.

    Activation Functions

    To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe each hidden layer node through a nonlinear function.

    In the model represented by the following graph, the value of each node in Hidden Layer 1 is transformed by a nonlinear function before being passed on to the weighted sums of the next layer. This nonlinear function is called the activation function.

    The same as the previous figure, except that a row of pink circles labeled 'Non-Linear Transformation Layer' has been added in between the two hidden layers. image/svg+xml Output Hidden Layer 2 Non-Linear Transformation Layer (a.k.a. Activation Function) Hidden Layer 1 Input We Usually Don't Draw Non-Linear Transforms

    Figure 6. Graph of three-layer model with activation function.

    Now that we've added an activation function, adding layers has more impact. Stacking nonlinearities on nonlinearities lets us model very complicated relationships between the inputs and the predicted outputs. In brief, each layer is effectively learning a more complex, higher-level function over the raw inputs. If you'd like to develop more intuition on how this works, see Chris Olah's excellent blog post.

    Common Activation Functions

    The following sigmoid activation function converts the weighted sum to a value between 0 and 1.

    $$F(x)=\frac{1} {1+e^{-x}}$$

    Here's a plot:

    Sigmoid function image/svg+xml Sigmoid

    Figure 7. Sigmoid activation function.

    The following rectified linear unit activation function (or ReLU, for short) often works a little better than a smooth function like the sigmoid, while also being significantly easier to compute.

    $$F(x)=max(0,x)$$

    The superiority of ReLU is based on empirical findings, probably driven by ReLU having a more useful range of responsiveness. A sigmoid's responsiveness falls off relatively quickly on both sides.

    ReLU activation function image/svg+xml ReLU

    Figure 8. ReLU activation function.

    In fact, any mathematical function can serve as an activation function. Suppose that \(\sigma\) represents our activation function (Relu, Sigmoid, or whatever). Consequently, the value of a node in the network is given by the following formula:

    $$\sigma(\boldsymbol w \cdot \boldsymbol x+b)$$

    TensorFlow provides out-of-the-box support for a wide variety of activation functions. That said, we still recommend starting with ReLU.

    Summary

    Now our model has all the standard components of what people usually mean when they say "neural network":

    A caveat: neural networks aren't necessarily always better than feature crosses, but neural networks do offer a flexible alternative that works well in many cases.

    Help Center

    Introduction to Neural Networks: Playground Exercises

    A First Neural Network

    In this exercise, we will train our first little neural net. Neural nets will give us a way to learn nonlinear models without the use of explicit feature crosses.

    Task 1: The model as given combines our two input features into a single neuron. Will this model learn any nonlinearities? Run it to confirm your guess.

    Task 2: Try increasing the number of neurons in the hidden layer from 1 to 2, and also try changing from a Linear activation to a nonlinear activation like ReLU. Can you create a model that can learn nonlinearities?

    Task 3: Continue experimenting by adding or removing hidden layers and neurons per layer. Also feel free to change learning rates, regularization, and other learning settings. What is the smallest number of nodes and layers you can use that gives test loss of 0.177 or lower?

    (Answers appear just below the exercise.)



    Neural Net Initialization

    This exercise uses the XOR data again, but looks at the repeatability of training Neural Nets and the importance of initialization.

    Task 1: Run the model as given four or five times. Before each trial, hit the Reset the network button to get a new random initialization. (The Reset the network button is the circular reset arrow just to the left of the Play button.) Let each trial run for at least 500 steps to ensure convergence. What shape does each model output converge to? What does this say about the role of initialization in non-convex optimization?

    Task 2: Try making the model slightly more complex by adding a layer and a couple of extra nodes. Repeat the trials from Task 1. Does this add any additional stability to the results?

    (Answers appear just below the exercise.)



    Neural Net Spiral

    This data set is a noisy spiral. Obviously, a linear model will fail here, but even manually defined feature crosses may be hard to construct.

    Task 1: Train the best model you can, using just X1 and X2. Feel free to add or remove layers and neurons, change learning settings like learning rate, regularization rate, and batch size. What is the best test loss you can get? How smooth is the model output surface?

    Task 2: Even with Neural Nets, some amount of feature engineering is often needed to achieve best performance. Try adding in additional cross product features or other transformations like sin(X1) and sin(X2). Do you get a better model? Is the model output surface any smoother?

    (Answers appear just below the exercise.)



    Help Center

    Introduction to Neural Networks: Programming Exercise

    The following exercise demonstrates how to use neural nets to learn nonlinearities:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Intro to Neural Nets programming exercise
  • Help Center

    Training Neural Networks

    Backpropagation is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks. TensorFlow handles backpropagation automatically, so you don't need a deep understanding of the algorithm. To get a sense of how it works, walk through the following: Backpropagation algorithm visual explanation. As you scroll through the preceding explanation, note the following:

    Backprop: What You Need To Know

    • Gradients are important
      • If it's differentiable, we can probably learn on it

    Backprop: What You Need To Know

    • Gradients are important
      • If it's differentiable, we can probably learn on it
    • Gradients can vanish
      • Each additional layer can successively reduce signal vs. noise
      • ReLus are useful here

    Backprop: What You Need To Know

    • Gradients are important
      • If it's differentiable, we can probably learn on it
    • Gradients can vanish
      • Each additional layer can successively reduce signal vs. noise
      • ReLus are useful here
    • Gradients can explode
      • Learning rates are important here
      • Batch normalization (useful knob) can help

    Backprop: What You Need To Know

    • Gradients are important
      • If it's differentiable, we can probably learn on it
    • Gradients can vanish
      • Each additional layer can successively reduce signal vs. noise
      • ReLus are useful here
    • Gradients can explode
      • Learning rates are important here
      • Batch normalization (useful knob) can help
    • ReLu layers can die
      • Keep calm and lower your learning rates

    Normalizing Feature Values

    • We'd like our features to have reasonable scales
      • Roughly zero-centered, [-1, 1] range often works well
      • Helps gradient descent converge; avoid NaN trap
      • Avoiding outlier values can also help
    • Can use a few standard methods:
      • Linear scaling
      • Hard cap (clipping) to max, min
      • Log scaling

    Dropout Regularization

    • Dropout: Another form of regularization, useful for NNs
    • Works by randomly "dropping out" units in a network for a single gradient step
      • There's a connection to ensemble models here
    • The more you drop out, the stronger the regularization
      • 0.0 = no dropout regularization
      • 1.0 = drop everything out! learns nothing
      • Intermediate values more useful
    Help Center

    Training Neural Networks: Best Practices

    This section explains backpropagation's failure cases and the most common way to regularize a neural network.

    Failure Cases

    There are a number of common ways for backpropagation to go wrong.

    Vanishing Gradients

    The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.

    When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.

    The ReLU activation function can help prevent vanishing gradients.

    Exploding Gradients

    If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.

    Batch normalization can help prevent exploding gradients, as can lowering the learning rate.

    Dead ReLU Units

    Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.

    Lowering the learning rate can help keep ReLU units from dying.

    Dropout Regularization

    Yet another form of regularization, called Dropout, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:

    Help Center

    Training Neural Networks: Programming Exercise

    The following exercise focuses on improving the performance of the neural net you trained in the previous exercise:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Improving Neural Net Performance programming exercise
  • Help Center

    Multi-Class Neural Networks

    Earlier, you encountered binary classification models that could pick between one of two possible choices, such as whether:

    In this module, we'll investigate multi-class classification, which can pick from multiple possibilities. For example:

    Some real-world multi-class problems entail choosing from millions of separate classes. For example, consider a multi-class classification model that can identify the image of just about anything.

    More than two classes?

    • Logistic regression gives useful probabilities for binary-class problems.
      • spam / not-spam
      • click / not-click
    • What about multi-class problems?
      • apple, banana, car, cardiologist, ..., walk sign, zebra, zoo
      • red, orange, yellow, green, blue, indigo, violet
      • animal, vegetable, mineral

    One-Vs-All Multi-Class

    • Create a unique output for each possible class
    • Train that on a signal of "my class" vs "all other classes"
    • Can do in a deep network, or with separate models
    A neural network with five hidden layers and five output layers. image/svg+xml hidden hidden logits one-vs-all(sigmoid) apples: yes/no? bear: yes/no? candy: yes/no? dog: yes/no? egg: yes/no?

    SoftMax Multi-Class

    • Add an additional constraint: Require output of all one-vs-all nodes to sum to 1.0
    • The additional constraint helps training converge quickly
    • Plus, allows outputs to be interpreted as probabilities
    A deep neural net with an input layer, two nondescript hidden layers, then a Softmax layer, and finally an output layer with the same number of nodes as the Softmax layer. image/svg+xmlapple: yes/no? bear: yes/no? candy: yes/no? dog: yes/no? egg: yes/no? Softmax hidden hidden logits

    What to use When?

    • Multi-Class, Single-Label Classification:
      • An example may be a member of only one class.
      • Constraint that classes are mutually exclusive is helpful structure.
      • Useful to encode this in the loss.
      • Use one softmax loss for all possible classes.
    • Multi-Class, Multi-Label Classification:
      • An example may be a member of more than one class.
      • No additional constraints on class membership to exploit.
      • One logistic regression loss for each possible class.

    SoftMax Options

    • Full SoftMax
      • Brute force; calculates for all classes.

    SoftMax Options

    • Full SoftMax
      • Brute force; calculates for all classes.
    • Candidate Sampling
      • Calculates for all the positive labels, but only for a random sample of negatives.
    Help Center

    Multi-Class Neural Networks: One vs. All

    One vs. all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question. For example, given a picture of a dog, five different recognizers might be trained, four seeing the image as a negative example (not a dog) and one seeing the image as a positive example (a dog). That is:

    1. Is this image an apple? No.
    2. Is this image a bear? No.
    3. Is this image candy? No.
    4. Is this image a dog? Yes.
    5. Is this image an egg? No.

    This approach is fairly reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises.

    We can create a significantly more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. The following figure suggests this approach:

    A neural network with five hidden layers and five output layers. image/svg+xml hidden hidden logits one-vs-all(sigmoid) apples: yes/no? bear: yes/no? candy: yes/no? dog: yes/no? egg: yes/no?

    Figure 1. A one-vs.-all neural network.

    Help Center

    Multi-Class Neural Networks: Softmax

    Recall that logistic regression produces a decimal between 0 and 1.0. For example, a logistic regression output of 0.8 from an email classifier suggests an 80% chance of an email being spam and a 20% chance of it being not spam. Clearly, the sum of the probabilities of an email being either spam or not spam is 1.0.

    Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.

    For example, returning to the image analysis we saw in Figure 1, Softmax might produce the following likelihoods of an image belonging to a particular class:

    Class Probability
    apple 0.001
    bear 0.04
    candy 0.008
    dog 0.95
    egg 0.001

    Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

    A deep neural net with an input layer, two nondescript hidden layers, then a Softmax layer, and finally an output layer with the same number of nodes as the Softmax layer. image/svg+xmlapple: yes/no? bear: yes/no? candy: yes/no? dog: yes/no? egg: yes/no? Softmax hidden hidden logits

    Figure 2. A Softmax layer within a neural network.

    Softmax Options

    Consider the following variants of Softmax:

    Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.

    One Label vs. Many Labels

    Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For such examples:

    For example, suppose your examples are images containing exactly one item—a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit—then you'll have to use multiple logistic regressions instead.

    Help Center

    Multi-Class Neural Networks: Programming Exercise

    In the following exercise, you'll explore Softmax in TensorFlow by developing a model that will classify handwritten digits:

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • MNIST Digit Classification programming exercise
  • Help Center

    Embeddings

    An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

    Motivation From Collaborative Filtering

    • Input: 1,000,000 movies that 500,000 users have chosen to watch
    • Task: Recommend movies to users

    To solve this problem some method is needed to determine which movies are similar to each other.

    Organizing Movies by Similarity (1d)

    A list of movies ordered in a single line from left to right. Starting with the left, 'Shrek', 'The Incredibles', 'The Triplets of Belleville', 'Harry Potter', 'Star Wars', 'Bleu', 'The Dark Knight Rises', and 'Memento'

    Organizing Movies by Similarity (2d)

    The same list of movies in the previous slide but arranged across two dimensions, so for example 'Shrek' is to the left and above of 'The Incredibles

    Two-Dimensional Embedding

    Similar to the previous diagram but with axis and labels for each quadrant. The arrangement of the movies is the following: in the first, upper-right quadrant is Adult Blockbusters containing 'Star Wars' and 'The Dark Knight Rises' with the movies 'Hero' and 'Crouching Tiger, Hidden Dragon' were added to the Adult Blockbusters quadrant. The second, lower-right quadrant is Adult Arthouse containing the movies 'Bleu' and 'Memento' with 'Waking Life' added to the Adult Arthouse quadrant. The third, lower-left quadrant is Children Arthouse and it contains the movie 'The Triplets of Belleville' and 'Wallace and Gromit' is added to the Children Arthouse quadrant. The fourth and final quadrant in the upper-left is Children Blockbusters containing 'Shrek', 'The Incredibles', and 'Harry Potter' and the movie 'School of Rock' is added to the Children Blockbusters quadrant.

    Two-Dimensional Embedding

    The same arrangement as the last slide. 'Shrek' and 'Bleu' are highlighted as examples of their coordinates in the 2d embedding plane.

    d-Dimensional Embeddings

    • Assumes user interest in movies can be roughly explained by d aspects
    • Each movie becomes a d-dimensional point where the value in dimension d represents how much the movie fits that aspect
    • Embeddings can be learned from data

    Learning Embeddings in a Deep Network

    • No separate training process needed -- the embedding layer is just a hidden layer with one unit per dimension
    • Supervised information (e.g. users watched the same two movies) tailors the learned embeddings for the desired task
    • Intuitively the hidden units discover how to organize the items in the d-dimensional space in a way to best optimize the final objective

    Input Representation

    • Each example (a row in this matrix) is a sparse vector of features (movies) that have been watched by the user
    • Dense representation of this example as: (0, 1, 0, 1, 0, 0, 0, 1)

    Is not efficient in terms of space and time.

    A table where each column header is a movie and each row represents a user and the movies they have watched.

    Input Representation

    • Build a dictionary mapping each feature to an integer from 0, ..., # movies - 1
    • Efficiently represent the sparse vector as just the movies the user watched. This might be represented as: Based on the column position of the movies in the sparse vector displayed on the right, the movies 'The Triplets from Belleville', 'Wallace and Gromit', and 'Memento' can be efficiently represented as (0,1, 999999)
    A sparse vector represented as a table with each column representing a movie and each row representing a user. The table contains the movies from the previous diagrams and is numbered from 1 to 999999. Each cell of the table is checked if a user has watched a movie.

    An Embedding Layer in a Deep Network

    Regression problem to predict home sales prices:

    A diagram of a deep neural network used to predict home sale prices image/svg+xml Sparse Vector Encoding 3 DimensionalEmbedding ... Words in real estate ad Latitude Longitude Sale Price L2 Loss Highlights the sparse vector encoding Highlights hidden three-dimensional embedding layer Highlights the additional latitude and longitude input features Highlights input features feeding into multiple hidden layers Highlights the output of the deep neural network

    An Embedding Layer in a Deep Network

    Regression problem to predict home sales prices:

    A diagram of a deep neural network used to predict home sale prices image/svg+xml Sparse Vector Encoding 3 DimensionalEmbedding ... Words in real estate ad Latitude Longitude Sale Price L2 Loss Highlights the sparse vector encoding Highlights hidden three-dimensional embedding layer Highlights the additional latitude and longitude input features Highlights input features feeding into multiple hidden layers Highlights the output of the deep neural network

    An Embedding Layer in a Deep Network

    Regression problem to predict home sales prices:

    A diagram of a deep neural network used to predict home sale prices image/svg+xml Sparse Vector Encoding 3 DimensionalEmbedding ... Words in real estate ad Latitude Longitude Sale Price L2 Loss Highlights the sparse vector encoding Highlights hidden three-dimensional embedding layer Highlights the additional latitude and longitude input features Highlights input features feeding into multiple hidden layers Highlights the output of the deep neural network

    An Embedding Layer in a Deep Network

    Regression problem to predict home sales prices:

    A diagram of a deep neural network used to predict home sale prices image/svg+xml Sparse Vector Encoding 3 DimensionalEmbedding ... Words in real estate ad Latitude Longitude Sale Price L2 Loss Highlights the sparse vector encoding Highlights hidden three-dimensional embedding layer Highlights the additional latitude and longitude input features Highlights input features feeding into multiple hidden layers Highlights the output of the deep neural network

    An Embedding Layer in a Deep Network

    Regression problem to predict home sales prices:

    A diagram of a deep neural network used to predict home sale prices image/svg+xml Sparse Vector Encoding 3 DimensionalEmbedding ... Words in real estate ad Latitude Longitude Sale Price L2 Loss Highlights the sparse vector encoding Highlights hidden three-dimensional embedding layer Highlights the additional latitude and longitude input features Highlights input features feeding into multiple hidden layers Highlights the output of the deep neural network

    An Embedding Layer in a Deep Network

    Regression problem to predict home sales prices:

    A diagram of a deep neural network used to predict home sale prices image/svg+xml Sparse Vector Encoding 3 DimensionalEmbedding ... Words in real estate ad Latitude Longitude Sale Price L2 Loss Highlights the sparse vector encoding Highlights hidden three-dimensional embedding layer Highlights the additional latitude and longitude input features Highlights input features feeding into multiple hidden layers Highlights the output of the deep neural network

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Multiclass Classification to predict a handwritten digit:

    Diagram of a deep neural network used to predict handwritten digits image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding Raw bitmap of the hand drawn digit Other features ... ... Target Class Label 0 1 2 3 4 5 6 7 8 9 "One-hot"target probdist. (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Highlights the output as a logit layer. Highlights the target class layer. Highlights the sparse vector encoding as an input. Highlights other optional unnamed features.

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    An Embedding Layer in a Deep Network

    Collaborative Filtering to predict movies to recommend:

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    Correspondence to Geometric View

    Deep Network

    • Each of hidden units corresponds to a dimension (latent feature)
    • Edge weights between a movie and hidden layer are coordinate values
    • A tree diagram of a deep neural network with a nodes in the lowest layer connected to three points in next higher layer

    Geometric view of a single movie embedding

    A point in 3 dimensional space corresponding to the lower layer node in the deep neural network disagram.

    Selecting How Many Embeddings Dims

    • Higher-dimensional embeddings can more accurately represent the relationships between input values
    • But more dimensions increases the chance of overfitting and leads to slower training
    • Empirical rule-of-thumb (a good starting point but should be tuned using the validation data):
    • $$ dimensions \approx \sqrt[4]{possible\;values} $$

    Embeddings as a Tool

    • Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other
    • Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric
    • Jointly embedding diverse data types (e.g. text, images, audio, ...) define a similarity between them
    Help Center

    Embeddings: Motivation From Collaborative Filtering

    Collaborative filtering is the task of making predictions about the interests of a user based on interests of many other users. As an example, let's look at the task of movie recommendation. Suppose we have 1,000,000 users, and a list of the movies each user has watched (from a catalog of 500,000 movies). Our goal is to recommend movies to users.

    To solve this problem some method is needed to determine which movies are similar to each other. We can achieve this goal by embedding the movies into a low-dimensional space created such that similar movies are nearby.

    Before describing how we can learn the embedding, we first explore the type of qualities we want the embedding to have, and how we will represent the training data for learning the embedding.

    Arrange Movies on a One-Dimensional Number Line

    To help develop intuition about embeddings, on a piece of paper, try to arrange the following movies on a one-dimensional number line so that the movies nearest each other are the most closely related:

    Movie Rating Description
    Bleu R A French widow grieves the loss of her husband and daughter after they perish in a car accident.
    The Dark Knight Rises PG-13 Batman endeavors to save Gotham City from nuclear annihilation in this sequel to The Dark Knight, set in the DC Comics universe.
    Harry Potter and the Sorcerer's Stone PG A orphaned boy discovers he is a wizard and enrolls in Hogwarts School of Witchcraft and Wizardry, where he wages his first battle against the evil Lord Voldemort.
    The Incredibles PG A family of superheroes forced to live as civilians in suburbia come out of retirement to save the superhero race from Syndrome and his killer robot.
    Shrek PG A lovable ogre and his donkey sidekick set off on a mission to rescue Princess Fiona, who is emprisoned in her castle by a dragon.
    Star Wars PG Luke Skywalker and Han Solo team up with two androids to rescue Princess Leia and save the galaxy.
    The Triplets of Belleville PG-13 When professional cycler Champion is kidnapped during the Tour de France, his grandmother and overweight dog journey overseas to rescue him, with the help of a trio of elderly jazz singers.
    Memento R An amnesiac desperately seeks to solve his wife's murder by tattooing clues onto his body.

    Arrange Movies in a Two-Dimensional Space

    Try the same exercise as before, but this time arrange the same movies in a two-dimensional space.

    Help Center

    Embeddings: Categorical Input Data

    Categorical data refers to input features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person.

    Categorical data is most efficiently represented via sparse tensors, which are tensors with very few non-zero elements. For example, if we're building a movie recommendation model, we can assign a unique ID to each possible movie, and then represent each user by a sparse tensor of the movies they have watched, as shown in Figure 3.

    A sample input for our movie recommendation problem.

    Figure 3. Data for our movie recommendation problem.

    Each row of the matrix in Figure 3 is an example capturing a user's movie-viewing history, and is represented as a sparse tensor because each user only watches a small fraction of all possible movies. The last row corresponds to the sparse tensor [1, 3, 999999], using the vocabulary indices shown above the movie icons.

    Likewise one can represent words, sentences, and documents as sparse vectors where each word in the vocabulary plays a role similar to the movies in our recommendation example.

    In order to use such representations within a machine learning system, we need a way to represent each sparse vector as a vector of numbers so that semantically similar items (movies or words) have similar distances in the vector space. But how do you represent a word as a vector of numbers?

    The simplest way is to define a giant input layer with a node for every word in your vocabulary, or at least a node for every word that appears in your data. If 500,000 unique words appear in your data, you could represent a word with a length 500,000 vector and assign each word to a slot in the vector.

    If you assign "horse" to index 1247, then to feed "horse" into your network you might copy a 1 into the 1247th input node and 0s into all the rest. This sort of representation is called a one-hot encoding, because only one index has a non-zero value.

    More typically your vector might contain counts of the words in a larger chunk of text. This is known as a "bag of words" representation. In a bag-of-words vector, several of the 500,000 nodes would have non-zero value.

    But however you determine the non-zero values, one-node-per-word gives you very sparse input vectors—very large vectors with relatively few non-zero values. Sparse representations have a couple of problems that can make it hard for a model to learn effectively.

    Size of Network

    Huge input vectors mean a super-huge number of weights for a neural network. If there are M words in your vocabulary and N nodes in the first layer of the network above the input, you have MxN weights to train for that layer. A large number of weights causes further problems:

    Lack of Meaningful Relations Between Vectors

    If you feed the pixel values of RGB channels into an image classifier, it makes sense to talk about "close" values. Reddish blue is close to pure blue, both semantically and in terms of the geometric distance between vectors. But a vector with a 1 at index 1247 for "horse" is not any closer to a vector with a 1 at index 50,430 for "antelope" than it is to a vector with a 1 at index 238 for "television".

    The Solution: Embeddings

    The solution to these problems is to use embeddings, which translate large sparse vectors into a lower-dimensional space that preserves semantic relationships. We'll explore embeddings intuitively, conceptually, and programmatically in the following sections of this module.

    Help Center

    Embeddings: Translating to a Lower-Dimensional Space

    You can solve the core problems of sparse input data by mapping your high-dimensional data into a lower-dimensional space.

    As you can see from the paper exercises, even a small multi-dimensional space provides the freedom to group semantically similar items together and keep dissimilar items far apart. Position (distance and direction) in the vector space can encode semantics in a good embedding. For example, the following visualizations of real embeddings show geometrical relationships that capture semantic relations like the relation between a country and its capital:

    image/svg+xml Verb Tense swimming swam walking walked Country-Capital Canada Ottawa Turkey Ankara Russia Moscow Spain Madrid Italy Rome Germany Berlin Japan Tokyo Vietnam Hanoi China Beijing Male-Female king queen man woman

    Figure 4. Embeddings can produce remarkable analogies.

    This sort of meaningful space gives your machine learning system opportunities to detect patterns that may help with the learning task.

    Shrinking the network

    While we want enough dimensions to encode rich semantic relations, we also want an embedding space that is small enough to allow us to train our system more quickly. A useful embedding may be on the order of hundreds of dimensions. This is likely several orders of magnitude smaller than the size of your vocabulary for a natural language task.

    Embeddings as lookup tables

    An embedding is a matrix in which each column is the vector that corresponds to an item in your vocabulary. To get the dense vector for a single vocabulary item, you retrieve the column corresponding to that item.

    But how would you translate a sparse bag of words vector? To get the dense vector for a sparse vector representing multiple vocabulary items (all the words in a sentence or paragraph, for example), you could retrieve the embedding for each individual item and then add them together.

    If the sparse vector contains counts of the vocabulary items, you could multiply each embedding by the count of its corresponding item before adding it to the sum.

    These operations may look familiar.

    Embedding lookup as matrix multiplication

    The lookup, multiplication, and addition procedure we've just described is equivalent to matrix multiplication. Given a 1 X N sparse representation S and an N X M embedding table E, the matrix multiplication S X E gives you the 1 X M dense vector.

    But how do you get E in the first place? We'll take a look at how to obtain embeddings in the next section.

    Help Center

    Embeddings: Obtaining Embeddings

    There are a number of ways to get an embedding, including a state-of-the-art algorithm created at Google.

    Standard Dimensionality Reduction Techniques

    There are many existing mathematical techniques for capturing the important structure of a high-dimensional space in a low dimensional space. In theory, any of these techniques could be used to create an embedding for a machine learning system.

    For example, principal component analysis (PCA) has been used to create word embeddings. Given a set of instances like bag of words vectors, PCA tries to find highly correlated dimensions that can be collapsed into a single dimension.

    Word2vec

    Word2vec is an algorithm invented at Google for training word embeddings. Word2vec relies on the distributional hypothesis to map semantically similar words to geometrically close embedding vectors.

    The distributional hypothesis states that words which often have the same neighboring words tend to be semantically similar. Both "dog" and "cat" frequently appear close to the word "vet", and this fact reflects their semantic similarity. As the linguist John Firth put it in 1957, "You shall know a word by the company it keeps".

    Word2Vec exploits contextual information like this by training a neural net to distinguish actually co-occurring groups of words from randomly grouped words. The input layer takes a sparse representation of a target word together with one or more context words. This input connects to a single, smaller hidden layer.

    In one version of the algorithm, the system makes a negative example by substituting a random noise word for the target word. Given the positive example "the plane flies", the system might swap in "jogging" to create the contrasting negative example "the jogging flies".

    The other version of the algorithm creates negative examples by pairing the true target word with randomly chosen context words. So it might take the positive examples (the, plane), (flies, plane) and the negative examples (compiled, plane), (who, plane) and learn to identify which pairs actually appeared together in text.

    The classifier is not the real goal for either version of the system, however. After the model has been trained, you have an embedding. You can use the weights connecting the input layer with the hidden layer to map sparse representations of words to smaller vectors. This embedding can be reused in other classifiers.

    For more information about word2vec, see the tutorial on tensorflow.org

    Training an Embedding as Part of a Larger Model

    You can also learn an embedding as part of the neural network for your target task. This approach gets you an embedding well customized for your particular system, but may take longer than training the embedding separately.

    In general, when you have sparse data (or dense data that you'd like to embed), you can create an embedding unit that is just a special type of hidden unit of size d. This embedding layer can be combined with any other features and hidden layers. As in any DNN, the final layer will be the loss that is being optimized. For example, let's say we're performing collaborative filtering, where the goal is to predict a user's interests from the interests of other users. We can model this as a supervised learning problem by randomly setting aside (or holding out) a small number of the movies that the user has watched as the positive labels, and then optimize a softmax loss.

    Diagram of a deep neural network used to predict which movies to recommend image/svg+xml 3 DimensionalEmbedding Sparse Vector Encoding User Movies (subset used as input features) ... ... User Movies(subset to use as "labels") Target prob.dist (sparse) LogitLayer 0 1 2 3 4 5 6 7 8 9 Softmax Loss Highlights the sparse vector encoding of the user movies as an input. Highlights the sparse vector encoding feeding into a hidden three-dimensional embedding layer. Highlights input features feeding into multiple hidden layers. Other features (optional) Highlights other optional unnamed features. ... Highlights the output as a logit layer. Highlights the target-class layer ...

    Figure 5. A sample DNN architecture for learning movie embeddings from collaborative filtering data.

    As another example if you want to create an embedding layer for the words in a real-estate ad as part of a DNN to predict housing prices then you'd optimize an L2 Loss using the known sale price of homes in your training data as the label.

    When learning a d-dimensional embedding each item is mapped to a point in a d-dimensional space so that the similar items are nearby in this space. Figure 6 helps to illustrate the relationship between the weights learned in the embedding layer and the geometric view. The edge weights between an input node and the nodes in the d-dimensional embedding layer correspond to the coordinate values for each of the d axes.

    A figure illustrating the relationship between the embedding layer weights and the geometric view of the embedding. image/svg+xml Deep Network Geometric view ofa single movieembedding

    Figure 6. A geometric view of the embedding layer weights.

    Help Center

    Embeddings: Programming Exercise

    In the following exercise, you'll explore embeddings in TensorFlow by building a neural network that will perform sentiment analysis on movie-review data.

    Programming exercises run directly in your browser (no setup required!) using the Colaboratory platform. Colaboratory is supported on most major browsers, and is most thoroughly tested on desktop versions of Chrome and Firefox. If you'd prefer to download and run the exercises offline, see these instructions for setting up a local environment.

  • Embeddings programming exercise
  • Help Center

    Production ML Systems

    There's a lot more to machine learning than just implementing an ML algorithm. A production ML system involves a significant number of components.

    So far, we've talked about this

    ML system diagram containing the following components: data collection, feature extraction, process management tools, data verification, configuration, machine resource management, monitoring, and serving infrastructure, and ML code. The ML code part of the diagram is dwarfed by the other nine components. image/svg+xml Non-Machine Learning System Blocks Data Collection Block Data Collection Data Verification Block DataVerification Machine Resource Management Block MachineResourceManagement Serving Infrastructure Block ServingInfrastructure Monitoring Block Monitoring Configuration Block Configuration Analysis Tools Block Analysis Tools Process Management Tools Process Management Tools Feature Extraction Block Feature Extraction ML Code Block ML Code

    But, what about the rest of an ML system?

    ML system diagram containing the following components: data collection, feature extraction, process management tools, data verification, configuration, machine resource management, monitoring, and serving infrastructure, and ML code. The ML code part of the diagram is dwarfed by the other nine components. image/svg+xml Non-Machine Learning System Blocks Data Collection Block Data Collection Data Verification Block DataVerification Machine Resource Management Block MachineResourceManagement Serving Infrastructure Block ServingInfrastructure Monitoring Block Monitoring Configuration Block Configuration Analysis Tools Block Analysis Tools Process Management Tools Process Management Tools Feature Extraction Block Feature Extraction ML Code Block ML Code

    System-Level Components

    • No, you don't have to build everything yourself.
      • Re-use generic ML system components wherever possible.
      • Google CloudML solutions include Dataflow and TF Serving
      • Components can also be found in other platforms like Spark, Hadoop, etc.
      • How do you know what you need?
        • Understand a few ML system paradigms & their requirements

    Video Lecture Summary

    So far, Machine Learning Crash Course has focused on building ML models. However, as the following figure suggests, real-world production ML systems are large ecosystems of which the model is just a single part.

    ML system diagram containing the following components: data collection, feature extraction, process management tools, data verification, configuration, machine resource management, monitoring, and serving infrastructure, and ML code. The ML code part of the diagram is dwarfed by the other nine components. image/svg+xml Non-Machine Learning System Blocks Data Collection Block Data Collection Data Verification Block DataVerification Machine Resource Management Block MachineResourceManagement Serving Infrastructure Block ServingInfrastructure Monitoring Block Monitoring Configuration Block Configuration Analysis Tools Block Analysis Tools Process Management Tools Process Management Tools Feature Extraction Block Feature Extraction ML Code Block ML Code

    Figure 1. Real-world production ML system.

    The ML code is at the heart of a real-world ML production system, but that box often represents only 5% or less of the overall code of that total ML production system. (That's not a misprint.) Notice that a ML production system devotes considerable resources to input data—collecting it, verifying it, and extracting features from it. Furthermore, notice that a serving infrastructure must be in place to put the ML model's predictions into practical use in the real world.

    Fortunately, many of the components in the preceding figure are reusable. Furthermore, you don't have to build all the components in Figure 1 yourself.

    TensorFlow provides many of these components, but other options are available from other platforms such as Spark or Hadoop.

    Subsequent modules will help guide your design decisions in building a production ML system.

    Help Center

    Static vs. Dynamic Training

    Broadly speaking, there are two ways to train a model:

    ML System Paradigms: Training

    Static Model -- Trained Offline

    ML System Paradigms: Training

    Static Model -- Trained Offline

    Dynamic Model -- Trained Online

    ML System Paradigms: Training

    Static Model -- Trained Offline

    • Easy to build and test -- use batch train & test, iterate until good.

    Dynamic Model -- Trained Online

    ML System Paradigms: Training

    Static Model -- Trained Offline

    • Easy to build and test -- use batch train & test, iterate until good.
    • Still requires monitoring of inputs

    Dynamic Model -- Trained Online

    ML System Paradigms: Training

    Static Model -- Trained Offline


    Real World Example: Cancer Prediction

    • Model was trained to predict "probability patient has cancer" from medical records
    • Features included patient age, gender, prior medical conditions, hospital name, vital signs, test results
    • Model gave excellent performance on held-out test data
    • But model performed terribly on new patients -- why?
    Cancer cells

    Real World Example: Cancer Prediction

    Why do you think the model was unable to perform well on new patients? See if you can figure out the problem, and then click the Play button ▶ below to find out if you're correct.

    * We based this module very loosely (making some modifications along the way) on "Leakage in data mining: formulation, detection, and avoidance" by Kaufman, Rosset, and Perlich.

    Help Center

    ML Systems in the Real World: Literature

    In this lesson, you'll debug a real-world ML problem* related to 18th century literature.

    Real World Example: 18th Century Literature

    • Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
    Old Books

    Real World Example: 18th Century Literature

    • Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
    • Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets.
    Old Books

    Real World Example: 18th Century Literature

    • Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
    • Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets.
    • Trained model did nearly perfectly on test data, but researchers felt results were suspiciously accurate. What might have gone wrong?
    Old Books

    Real World Example: 18th Century Literature

    Why do you think test accuracy was suspiciously high? See if you can figure out the problem, and then click the Play button ▶ below to find out if you're correct.

    Real World Example: 18th Century Literature

    • Data Split A: Researchers put some of each author's examples in training set, some in validation set, some in test set.
    All of Richardson's examples might be in the training set, while all of Swift's examples might be in the validation set.
    image/svg+xml Training Set Validation Set Test Set All theSwiftExamples All theBlakeExamples All theDefoeExamples Swift Blake Defoe Swift Blake Defoe Swift Blake Defoe

    Real World Example: 18th Century Literature

    • Data Split B: Researchers put all of each author's examples in a single set.
    image/svg+xml Training Set Validation Set Test Set All theSwiftExamples All theBlakeExamples All theDefoeExamples Swift Blake Defoe Swift Blake Defoe Swift Blake Defoe

    Real World Example: 18th Century Literature

    • Data Split A: Researchers put some of each author's examples in training set, some in validation set, some in test set.
    • Data Split B: Researchers put all of each author's examples in a single set.
    • Results: The model trained on Data Split A had much higher accuracy than the model trained on Data Split B.

    Real World Example: 18th Century Literature

    The moral: carefully consider how you split examples.

    Know what the data represents.

    * We based this module very loosely (making some modifications along the way) on "Meaning and Mining: the Impact of Implicit Assumptions in Data Mining for the Humanities" by Sculley and Pasanek.

    Help Center

    ML Systems in the Real World

    This lesson summarizes the guidelines learned from these real-world examples.

    Some Effective ML Guidelines

    • Keep the first model simple

    Some Effective ML Guidelines

    • Keep the first model simple
    • Focus on ensuring data pipeline correctness

    Some Effective ML Guidelines

    • Keep the first model simple
    • Focus on ensuring data pipeline correctness
    • Use a simple, observable metric for training & evaluation

    Some Effective ML Guidelines

    • Keep the first model simple
    • Focus on ensuring data pipeline correctness
    • Use a simple, observable metric for training & evaluation
    • Own and monitor your input features

    Some Effective ML Guidelines

    • Keep the first model simple
    • Focus on ensuring data pipeline correctness
    • Use a simple, observable metric for training & evaluation
    • Own and monitor your input features
    • Treat your model configuration as code: review it, check it in

    Some Effective ML Guidelines

    • Keep the first model simple
    • Focus on ensuring data pipeline correctness
    • Use a simple, observable metric for training & evaluation
    • Own and monitor your input features
    • Treat your model configuration as code: review it, check it in
    • Write down the results of all experiments, especially "failures"

    Video Lecture Summary

    Here's a quick synopsis of effective ML guidelines:

    Other Resources

    Rules of Machine Learning contains additional guidance.

    Help Center

    Next Steps

    TensorFlow skills, check out the following resources:

    Machine Learning Practica

    Check out these real-world case studies of how Google uses machine learning in its products, with video and hands-on coding exercises:

    Other Machine Learning Resources

    TensorFlow

    Join a Kaggle Competition

    Ready to apply your new ML skills to a real-world data-science challenge? Try your hand at one of the many competitions on Kaggle!

    Try a Competition!